Title: October 25 Slide 1
1Linux Clusters for High-Performance Computing
- Jim Phillips and Tim Skirvin
- Theoretical and Computational Biophysics
- Beckman Institute
2HPC vs High-Availability
- There are two major types of Linux clusters
- High-Performance Computing
- Multiple computers running a single job for
increased performance - High-Availability
- Multiple computers running the same job for
increased reliability - We will be talking about the former!
3Why Clusters?
- Cheap alternative to big iron
- Local development platform for big iron code
- Built to task (buy only what you need)
- Built from COTS components
- Runs COTS software (Linux/MPI)
- Lower yearly maintenance costs
- Single failure does not take down entire facility
- Re-deploy as desktops or throw away
4Why Not Clusters?
- Non-parallelizable or tightly coupled application
- Cost of porting large existing codebase too high
- No source code for application
- No local expertise (dont know Unix)
- No vendor hand holding
- Massive I/O or memory requirements
5Know Your Users
- Who are you building the cluster for?
- Yourself and two grad students?
- Yourself and twenty grad students?
- Your entire department or university?
- Are they clueless, competitive, or malicious?
- How will you to allocate resources among them?
- Will they expect an existing infrastructure?
- How well will they tolerate system downtimes?
6Your Users Goals
- Do you want increased throughput?
- Large number of queued serial jobs.
- Standard applications, no changes needed.
- Or decreased turnaround time?
- Small number of highly parallel jobs.
- Parallelized applications, changes required.
7Your Application
- The best benchmark for making decisions is your
application running your dataset. - Designing a cluster is about trade-offs.
- Your application determines your choices.
- No supercomputer runs everything well either.
- Never buy hardware until the application is
parallelized, ported, tested, and debugged.
8Your Application Parallel Performance
- How much memory per node?
- How would it scale on an ideal machine?
- How is scaling affected by
- Latency (time needed for small messages)?
- Bandwidth (time per byte for large messages)?
- Multiprocessor nodes?
- How fast do you need to run?
9Budget
- Figure out how much money you have to spend.
- Dont spend money on problems you wont have.
- Design the system to just run your application.
- Never solve problems you cant afford to have.
- Fast network on 20 nodes or slower on 100?
- Dont buy the hardware until
- The application is ported, tested, and debugged.
- The science is ready to run.
10Environment
- The cluster needs somewhere to live.
- You wont want it in your office.
- Not even in your grad students office.
- Cluster needs
- Space (keep the fire martial happy).
- Power
- Cooling
11Environment Power
- Make sure you have enough power.
- Kill-A-Watt
- 30 at ThinkGeek
- 1.3Ghz Athlon draws 183 VA at full load
- Newer systems draw more measure for yourself!
- More efficient power supplies help
- Wall circuits typically supply about 20 Amps
- Around 12 PCs _at_ 183VA max (8-10 for safety)
12Environment Power Factor
- More efficient power supplies do help!
- Always test your power under load.
W V x A x PF
13Environment Uninterruptible Power
- 5kVA UPS (3,000)
- Holds 24 PCs _at_183VA (safely)
- Will need to work out building power to them
- May not need UPS for all systems, just root node
14Environment Cooling
- Building AC will only get you so far
- Make sure you have enough cooling.
- One PC _at_183VA puts out 600 BTU of heat.
- 1 ton of AC 12,000 BTUs 3500 Watts
- Can run 20 CPUs per ton of AC
15Hardware
- Many important decisions to make
- Keep application performance, users, environment,
local expertise, and budget in mind - An exercise in systems integration, making many
separate components work well as a unit - A reliable but slightly slower cluster is better
than a fast but non-functioning cluster - Always benchmark a demo system first!
16Hardware Networking
- Two main options
- Gigabit Ethernet cheap (100-200/node),
universally supported and tested, cheap commodity
switches up to 48 ports. - 24-port switches seem the best bang-for-buck
- Special interconnects
- Myrinet very expensive (thousands per node),
very low latency, logarithmic cost model for very
large clusters. - Infiniband similar, less common, not as well
supported.
17Hardware Other Components
- Filtered Power (Isobar, Data Shield, etc)
- Network Cables buy good ones, youll save
debugging time later - If a cable is at all questionable, throw it away!
- Power Cables
- Monitor
- Video/Keyboard Cables
18User Rules of Thumb
- 1-4 users
- Yes, you still want a queueing system.
- Plan ahead to avoid idle time and conflicts.
- 5-20 users
- Put one person in charge of running things.
- Work out a fair-share or reservation system.
- gt 20 users
- User documentation and examples are essential.
- Decide who makes resource allocation decisions.
19Application Rules of Thumb
- 1-2 programs
- Dont pay for anything you wont use.
- Benchmark, benchmark, benchmark!
- Be sure to use your typical data.
- Try different compilers and compiler options.
- gt 2 programs
- Select the most standard OS environment.
- Benchmark those that will run the most.
- Consider a specialized cluster for dominant apps
only.
20Parallelization Rules of Thumb
- Throughput is easyapp runs as is.
- Turnaround is not
- Parallel speedup is limited by
- Time spent in non-parallel code.
- Time spent waiting for data from the network.
- Improve serial performance first
- Profile to find most time-consuming functions.
- Try new algorithms, libraries, hand tuning.
21Some Details Matter More
- What limiting factor do you hit first?
- Budget?
- Space, power, and cooling?
- Network speed?
- Memory speed?
- Processor speed?
- Expertise?
22Limited by Budget
- Dont waste money solving problems you cant
afford to have right now - Regular PCs on shelves (rolling carts)
- Gigabit networking and multiple jobs
- Benchmark performance per dollar.
- The last dollar you spend should be on whatever
improves your performance. - Ask for equipment funds in proposals!
23Limited by Space
- Benchmark performance per rack
- Consider all combinations of
- Rackmount nodes
- More expensive but no performance loss
- Dual-processor nodes
- Less memory bandwidth per processor
- Dual-core processors
- Less memory bandwidth per core
24Limited by Power/Cooling
- Benchmark performance per Watt
- Consider
- Opteron or PowerPC rather than Xeon
- Dual-processor nodes
- Dual-core processors
25Limited by Network Speed
- Benchmark your code at NCSA.
- 10,000 CPU-hours is easy to get.
- Try running one process per node.
- If that works, buy single-processor nodes.
- Try Myrinet.
- If that works, can you run at NCSA?
- Can you run more, smaller jobs?
26Limited by Serial Performance
- Is it memory performance? Try
- Single-core Opterons
- Single-processor nodes
- Larger cache CPUs
- Lower clock speed CPUs
- Is it really the processor itself? Try
- Higher clock speed CPUs
- Dual-core CPUs
27Limited by Expertise
- There is no substitute for a local expert.
- Qualifications
- Comfortable with the Unix command line.
- Comfortable with Linux administration.
- Cluster experience if you can get it.
28System Software
- Linux is just a starting point.
- Operating system,
- Libraries - message passing, numerical
- Compilers
- Queuing Systems
- Performance
- Stability
- System security
- Existing infrastructure considerations
29Scyld Beowulf / Clustermatic
- Single front-end master node
- Fully operational normal Linux installation.
- Bproc patches incorporate slave nodes.
- Severely restricted slave nodes
- Minimum installation, downloaded at boot.
- No daemons, users, logins, scripts, etc.
- No access to NFS servers except for master.
- Highly secure slave nodes as a result
30Oscar/ROCKS
- Each node is a full Linux install
- Offers access to a file system.
- Software tools help manage these large numbers of
machines. - Still more complicated than only maintaining one
master node. - Better suited for running multiple jobs on a
single cluster, vs one job on the whole cluster.
31System Software Compilers
- No point in buying fast hardware just to run poor
performing executables - Good compilers might provide 50-150 performance
improvement - May be cheaper to buy a 2,500 compiler license
than to buy more compute nodes - Benchmark real application with compiler, get an
eval compiler license if necessary
32System Software Message Passing Libraries
- Usually dictated by application code
- Choose something that will work well with
hardware, OS, and application - User-space message passing?
- MPI industry standard, many implementations by
many vendors, as well as several free
implementations - Others Charm, BIP, Fast Messages
33System Software Numerical Libraries
- Can provide a huge performance boost over
Numerical Recipes or in-house routines - Typically hand-optimized for each platform
- When applications spend a large fraction of
runtime in library code, it pays to buy a license
for a highly tuned library - Examples BLAS, FFTW, Interval libraries
34System Software Batch Queueing
- Clusters, although cheaper than big iron are
still expensive, so should be efficiently
utilized - The use of a batch queueing system can keep a
cluster running jobs 24/7 - Things to consider
- Allocation of sub-clusters?
- 1-CPU jobs on SMP nodes?
- Examples Sun Grid Engine, PBS, Load Leveler
35System Software Operating System
- Any annoying management or reliability issues get
hugely multiplied in a cluster environment. - Plan for security from the outset
- Clusters have special needs use something
appropriate for the application and hardware
36System Software Install It Yourself
- Dont use the vendors pre-loaded OS.
- They would love to sell you 100 licenses.
- What happens when you have to reinstall?
- Do you like talking to tech support?
- Are those flashy graphics really useful?
- How many security holes are there?
37Security Tips
- Restrict physical access to the cluster, if
possible. - Make sure youre involved in all tours, to make
sure nobody touches anything. - If youre on campus, put your clusters into the
Fully Closed network group - Might cause some limitations if youre trying to
submit from off-site - Will cause problems with GLOBUS
- The built-in firewall is your friend!
38Purchasing Tips Before You Begin
- Get your budget
- Work out the space, power, and cooling capacities
of the room. - Start talking to vendors early
- But dont commit!
- Dont fall in love with any one vendor until
youve looked at them all.
39Purchasing Tips Design Notes
- Make sure to order some spare nodes
- Serial nodes and hot-swap spares
- Keep them running to make sure they work.
- If possible, install HDs only in head node
- State law and UIUC policy requires all hard
drives to be wiped before disposal - It doesnt matter if the drive never stored
anything! - Each drive will take 8-10 hours to wipe.
- Save yourself a world of pain in a few years
- or just give your machines to some other campus
group, and make them worry about it.
40Purchasing Tips Get Local Service
- If a node dies, do you want to ship it?
- Two choices
- Local business (Champaign Computer)
- Major vendor (Sun)
- Ask others about responsiveness.
- Design your cluster so that you can still run
jobs if a couple of nodes are down.
41Purchasing Tips Dealing with Purchasing
- You will want to put the cluster order on a
Purchase Order (PO) - Do not pay for the cluster until it entirely
works. - Prepare a ten-point letter
- Necessary for all purchases gt25k.
- Examples are available with your business office
(or bug us for our examples). - These arent difficult to write, but will
probably be necessary.
42Purchasing Tips The Bid Process
- Any purchase gt28k must go up for bid
- Exception sole-source vendors
- Number grows every year
- Adds a month or so to the purchase time
- If you can keep the numbers below the magic 28k,
do it! - The bid limit may be leverage for vendors to drop
their prices just below the limit plan
accordingly. - You will get lots of junk bids
- Be very specific about your requirements to keep
them away!
43Purchasing Tips Working the Bid Process
- Use sole-source vendors where possible.
- This is a major reason why we buy from Sun.
- Check with your purchasing people.
- This wont help you get around the month time
loss, as the item still has to be posted. - Purchase your clusters in small chunks
- Only works if youre looking at a relatively
small cluster. - Again, you may be able to use this as leverage
with your vendor to lower their prices.
44Purchasing Tips Receiving Your Equipment
- Let Receiving know that the machines are coming.
- It will take up a lot of space on the loading
dock. - Working with them to save space will earn you
good will (and faster turnaround). - Take your machines out of Receivings space as
soon as reasonably possible.
45Purchasing Tips Consolidated Inventory
- Try to convince your Inventory workers to tag
each cluster, and not each machine - Its really going to be running as a cluster
anyway (right?). - This will make life easier on you.
- Repairs are easier when you dont have to worry
about inventory stickers - This will make life easier for them.
- 3 items to track instead of 72
46Purchasing Tips Assembly
- Get extra help for assembly
- Its reasonably fun work
- as long as the assembly line goes fast.
- Demand pizza.
- Test the assembly instructions before you begin
- Nothing is more annoying than having to realign
all of the rails after theyre all screwed in.
47Purchasing Tips Testing and Benchmarking
- Test the cluster before you put it into
production! - Sample jobs cpuburn
- Look at power consumption
- Test for dead nodes
- Remember vendors make mistakes!
- Even their demo applications may not work check
for yourself.
48Case Studies
- The best way to illustrate cluster design is to
look at how somebody else has done it. - The TCB Group has designed four separate Linux
clusters in the last six years
492001 Case Study
- Users
- Many researchers with MD simulations
- Need to supplement time on supercomputers
- Application
- Not memory-bound, runs well on IA32
- Scales to 32 CPUs with 100Mbps Ethernet
- Scales to 100 CPUs with Myrinet
502001 Case Study 2
- Budget
- Initially 20K, eventually grew to 100K
- Environment
- Full machine room, slowly clear out space
- Under-utilized 12kVA UPS, staff electrician
- 3 ton chilled water air conditioner (Liebert)
512001 Case Study 3
- Hardware
- Fastest AMD Athon CPUs available (1333 MHz).
- Fast CL2 SDRAM, but not DDR.
- Switched 100Mbps Ethernet, Intel EEPro cards.
- Small 40 GB hard drives and CD-ROMs.
- System Software
- Scyld clusters of 32 machines, 1 job/cluster.
- Existing DQS, NIS, NFS, etc. infrastructure.
522003 Case Study
- What changed since 2001
- 50 increase in processor speed
- 50 increase in NAMD serial performance
- Improved stability of SMP Linux kernel
- Inexpensive gigabit cards and 24-port switches
- Nearly full machine room and power supply
- Popularity of compact form factor cases
- Emphasis on interactive MD of small systems
532003 Case Study 2
- Budget
- Initially 65K, eventually grew to 100K
- Environment
- Same general machine room environment
- Additional machine room space is available in
server room - Just switched to using rack-mount equipment
- Still using the old clusters dont want to get
rid of them entirely - Need to be more space-conscious
542003 Case Sudy 3
- Option 1
- Single processor, small form factor nodes.
- Hyperthreaded Pentium 4 processors.
- 32 bit 33 MHz gigabit network cards.
- 24 port gigabit switch (24-processor clusters).
- Problems
- No ECC memory.
- Limited network performance.
- Too small for next-generation video cards.
552003 Case Study 4
- Final decision
- Dual Athlon MP 2600 in normal cases.
- No hard drives or CD-ROMs.
- 64 bit 66 MHz gigabit network cards.
- 24 port gigabit switch (48-proc clusters).
- Clustermatic OS, boot slaves off of floppy.
- Floppies have proven very unreliable, especially
when left in the drives. - Benefits
- Server class hardware w/ ECC memory.
- Maximum processor count for large simulations.
- Maximum network bandwidth for small simulations.
562003 Case Study 5
- Athlon clusters from 2001 recycled
- 36 nodes outfitted as desktops
- Added video cards, hard drives, extra RAM
- Cost 300/machine
- Now dead or in 16-node Condor test cluster
- 32 nodes donated to another group
- Remaining nodes move to server room
- 16-node Clustermatic cluster (used by guests)
- 12 spares and build/test boxes for developers
572004 Case Study
- What changed since 2003
- Technologically, not much!
- Space is more of an issue.
- A new machine room has been built for us.
- Vendors are desperate to sell systems at any
price.
582004 Case Study 2
- Budget
- Initially 130K, eventually grew to 180K
- Environment
- New machine room will store the new clusters.
- Two five-ton Liebert air conditioners have been
installed. - There is minimal floor space, enough for four
racks of equipment.
592004 Case Study 3
- Final decision
- 72x Sun V60x rack-mount servers.
- Dual 3.06GHz Intel processors only slightly
faster - 2GB RAM, Dual 36GB HDs, DVD-ROM included in deal
- Network-bootable gigabit ethernet built in
- Significantly more stable than any old cluster
machine - 3x 24 port gigabit switch (3x 48-processor
clusters) - 6x serial nodes (identical to above, also serve
as spares) - Sun Rack 900-38
- 26 systems per rack, plus switch and UPS for head
nodes - Clustermatic 4 on RedHat 9
602004 Case Study 4
- Benefits
- Improved stability over old clusters.
- Management is significantly easier with Sun
servers than PC whiteboxes. - Network booting of slaves allows lights-off
management. - Systems use up minimal floor space.
- Similar performance to 2003 allows all 6 clusters
(3 old 3 new) to take jobs from a single queue. - Less likely to run out of memory when running an
express queue job. - Complete machines easily retasked.
61For More Information
- http//www.ks.uiuc.edu/Development/Computers/Clust
er/ - http//www.ks.uiuc.edu/Training/Workshop/Clusters/
- We will be setting up a Clusters mailing list
some time in the next week or two - We will also be setting up a Clusters User Group
shortly, but that will take some more effort.