Title: Clusters in Molecular Sciences Applications
1Clusters in Molecular Sciences Applications
Serguei Patchkovskii_at_, Rochus Schmid_at_, Tom
Ziegler_at_, Siu Pang Chan, Andrew McCormack,
Roger Rousseau, Ian Skanes
_at_Department of Chemistry, University of Calgary,
2500 University Dr. NW, Calgary, Alberta, T2NÂ 1N4
Canada Theory and Computation Group, SIMS, NRC,
100 Sussex Dr., Ottawa, Ontario, K1A 0R6
2Overview
- Beowulf-style clusters entered mainstream
- Are clusters a lasting, efficient investment?
- Odysseus an internal cluster at the SIMS theory
group - Clusters in molecular science applications
software availability and performance - Three war stories, and a cautionary message
- Summary and conclusions
3Shared, Academic Clusters in Canada
Location CPUs URL of other info
Carleton U. 8xPII-400 www.scs.carleton.ca/gis/
UBC 256xPIII-1000 www.gdcfd.ubc.ca/Monster
U of Calgary 179xAlpha www.maci-cluster.ucalgary.ca
U of Western Ontario 144xAlpha GreatWhite.sharcnet.ca
U of Western Ontario 48xAlpha DeepPurple.sharcnet.ca
McMaster U 106xAlpha Idra.physics.mcmaster.ca
U of Guelph 120xAlpha Hammerhead.uoguelph.ca
U of Wundsor 8xAlpha
Winfrid Laurier U 8xAlpha
4Canadian top-500 facilities
Cluster
5Internal, workhorse clusters
Location CPUs URL or other
U of Alberta 98xPIII-450 www.phys.ualberta.ca/THOR
U of Calgary 94x21164-500 www.cobalt.chem.ucalgary.ca
U of Calgary 120xPIII-1000 www.ucalgary.ca/tieleman/elk.html
U of Calgary 32xPIII
Memorial U 32xPII-300 weland.esd.mun.ca
MDS Proteomics 400xPIII-1000 www.mdsproteomics.com
ICPET, NRC 80xPIII-800
DRAO, NRC 16xPII-450
SIMS, NRC 32xPIII-933
Samuel Lunenfeld Research Institute 224xPIII-450 Bioinfo.mshri.on.ca/yac/
Sherbrooke U 64xPII-400
U of Saskatchewan 12xAthlon-800 Sasquatch.usask.ca
Simon Frazer U 16xPIII-500 www.sfu.ca/acs/cluster/
U of Victoria 39xPIII-450 Pingu.phys.uvic.ca/muse/ (?)
McMaster U 32xPIII-700 www.cim.mcgill.ca/cvr/beowulf/
CERCA, Montreal 16xAthlon-1200 www.cerca.umontreal.ca/fourmano/
U of Western Ontario various www.baldric.uwo.ca
6Clusters are everywhere
- Lemma 1 A computationally-intensive research
group - in Canada can be in one of the three states
- It owns a cluster, or
- It builds a cluster, or
- It plans building a cluster RSN
7Cobalt Hardware
Computers on benches all linked together
8Cobalt Nodes and Network
Digital/Compaq Personal Workstation 500au.
CPU Alpha 21164A, 500 MHz
Cache 96Kb on-chip (L1 and L2)
Peak flops 109 Flop/second
SpecInt 95 15.7 (estimate)
SpecFP 95 19.5 (estimate)
4 x 3COM SuperStack II 3300
Peak aggregate b/w 500.0 MB/s
Peak internode b/w (TCP) 11.2 MB/s
NFS read/write 3.4/4.1 MB/s
Round-trip (TCP) 360 µs
Round-trip (UDP) 354 µs
9Cobalt Software
- OS, communications, and cluster management
- Base OS Tru64, using DMS, NIS, and NFS
- Compilers Digital/Compaq C, C, Fortran
- Communications PVM, MPICH
- Batch queuing DQS
- Application software
- ADF Amsterdam Density Functional (PVM)
- PAW Projector-Augmented Wave (MPI)
10Cobalt Return on the Investment
- Payback Research Articles
Total publications 92
including including
Organometallics 21
J. Am. Chem. Soc. 12
J. Phys. Chem. 11
J. Chem. Phys. 10
Inorg. Chem. 6
Total cost 390,800
including including
Initial purchase 346,000
Operating (98-01) Operating (98-01)
power (6/kWh) 15,800
admin (20 PDF) 24,000
spare parts 5,000
ROI 1 publication / 4,250
11Odysseus Low-tech solution for high-tech
problems1
12Odysseus Low-tech solution for high-tech
problems2
- Nodes (161)
- ABIT VP6 motherboard
- 2xPIII-933, 133MHz FSB
- 4x256Mbytes RAM
- 3COM 3C905C
- 36Gb 7200rpm IDE
- plus, on the front end
- Intel PRO/1000
- Adaptec AHA-2940UW
- 60Gb 7200rpm IDE
13Odysseus Low-tech solution for high-tech
problems3
- Network SCI 100Mbit
- Dolphin D339 (2D SCI)
- H ring
- V ring
- HP Procurve 2524 1Gig
14Odysseus Low-tech solution for high-tech
problems4
- Backup unit
- VXAtape (www.ecrix.com)
- 35Gbytes/cartridge (physical)
- TreeFrog autoloader (www.spectralogic.com)
- 16 cartridge capacity
- UPS Unit
- Powerware 5119
- 2880VA
15Odysseus Low-tech solution for high-tech
problems5
Odysseus at a glance Odysseus at a glance
Processors 32 (2)
Memory 16Gbytes
Disk 636Gbytes
Peak flops 29.9GFlops/sec
16Odysseus cost overview
Expense dollars
Nodes 40,640
SCI network (cards cables) 26,771
Backup unit (taperobot) 5,860
Spare parts in stock 5,024
Ethernet (switch, cables, and head node link) 4,190
Compiler (PGI) 3,780
UPS 2,265
Backup tapes (161) 1,911
Total 90,441
17Clusters in molecular science software
availability
- Gaussian
- Turbomole
- GAMESS
- NWChem
- GROMOS
- ADF
- PAW
- CPMD
- AMBER
- VASP
- PWSCF
- ABINIT
18Software ADF
- ADF Amsterdam Density Functional (www.scm.com)
- Example Cr(N)Porph
- Full geometry optimization
- 38 atoms
- 580 basis functions
- C4v symmetry
- 45Mbytes of memory
- Serial time 683 minutes
19Software PAW
- PAW Projector-Augmented Wave
- (www.pt.tu-clausthal.de/ptpb/PAW/pawmain.html)
Example SN2 reaction CH3I Rh(CO)2I2- 11Ã…
unit cell Serial time per step 83
seconds Memory 231Mbytes
20Software CPMD
- CPMD Car-Parinello Molecular Dynamic
- (www.mpi-stuttgart.mpg.de/parinello/)
Example H in Si64 65 atoms, periodic 40Ryd
cut-off Geometry opt (2 steps) free MD (70
steps)
odysseus
21Software AMBER
- AMBER Assisted Model Building with Energy
Refinement (www.amber.ucsf.edu/amber/)
Example 22-residue polypeptide4K 2500 H2O 1ns
MD
Time (hour)
Ncpu
22Software VASP
- VASP Vienna Ab-initio Simulation Package
(cms.mpi.univie.ac.at/vasp/)
Example Li198 1000GPa 300 eV cutoff 9
K-points 10 WF optimization steps stress tensor
odysseus
23Software PWSCF
- PWSCF and PHONON Plane wave pseudopotential
codes, optimized for phonon spectra calculations
(www.pwscf.org/)
Example MgB2 solid Geometry opt. 40 Ryd
cut-off 60 K-points
odysseus
24Software ABINIT
- ABINIT (www.mapr.ucl.ac.be/ABINIT/)
Example SiO2 (stishovite) 70Ryd cut-off 6
K-points 12 SCF iterations
25War Story 1
- Odysseus hardware maintenance log, Oct 19, 2001
- Overnight, node 6 had a kernel OOPS it responds
to network pings and keyboard, but no new
processes can be started - Reason
- Heat sink on CPU1 became loose, resulting
- in overheating under heavy load.
- Resolution
- Reinstall the heat sink
- Detected by
- Elevated temperature readings for the
- CPU1 (lm_sensors)
- Downtime
- 20 minutes (the affected node)
26War Story 2
- Odysseus hardware maintenance log, Nov 12, 2001
- A large, 16-CPU VASP job fails with LAPACK
Routine ZPOTRF failed, or random total energy - Reason
- DIMM in bank 0 on node 17 developed a single-
- bit failure at the address 0xfd9f0c
- Resolution
- Replace memory module in bank 0
- Detected by
- Rerunning failing job with different sets of
nodes, - followed by the memory diagnostic on the
affected - node (memtest32)
- Downtime
- 1 day (the whole cluster) 2 days (the affected
node)
27War Story 3
- Odysseus hardware maintenance log, Dec 10, 2001
- Apparently random application failures are
observed - Reason
- Multiple single-bit memory
- failures, on the nodes (bank )
- 6 (2), 7 (2,3), 8 (0),
- 10 (0), 11 (0)
- Resolution
- Replace memory modules
- Detected by
- Cluster-wide memory diagnostic (memtest32)
- Downtime
- 3 days (the whole cluster)
28Cautionary Note
- Using inexpensive, consumer-grade hardware
potentially exposes you to low-quality components - Never use components which have no built-in
hardware monitoring and error detection
capability - Always configure your clusters to report
corrected errors and out-of-range hardware
sensors readings. - Act on the early warnings
- Otherwise, you run a risk of producing garbage
science, and never knowing it
29Hardware Monitoring with Linux
Category Parameter Package
Motherboard Temperature Power supply voltage Fan status lm_sensors
Hard drives Corrected error counts Impending failure indicators ide-smart S.M.A.R.T. Suite
Memory Corrected error counts ecc.o
Network Hardware-dependent
http//www2.lm-sensors.nu/lm78/
http//www.linux-ide.org/smart.html
http//csl.cse.ucsc.edu/smart.shtml
http//www.anime.net/goemon/linux-ecc/ (2.2
kernels only)
30Summary and Conclusions
- Clusters are no longer a techno-geeks toy, and
will remain the primary workhorse of many
research groups, at least for a while - Clusters give an impressive return on the
investment, and may remain useful longer than
expected - Many (most?) useful research codes in molecular
sciences are readily available on clusters - Configuring and operating PC clusters can be
tricky. Consider a reputable system integrator
with Beowulf hardware and software experience