Title: Chemical Supercomputing on the Cheap:
1Chemical Supercomputing on the Cheap
- 94GFlops computer system at cdn3680/gigaflop
S. Patchkovskii, R. Schmid, and T. Ziegler
Department of Chemistry, University of Calgary,
2500 University Dr. NW, Calgary, Alberta, T2NÂ 1N4
Canada
2Introduction
Accurate quantum-chemical modeling of systems of
chemical interest is extremely computationally
intensive and requires substantial amounts of
memory and secondary storage. This has
traditionally consigned first-principles
calculations of chemical properties to large (and
expensive) vector and parallel computers, thus
placing them out of reach of most practical
chemists. With the ever-increasing computational
power of low-end workstations and commodity PCs,
it is now possible to perform useful
quantum-chemical calculations on inexpensive
off-the-shelf hardware. The widely available and
robust local area (LAN) network technologies,
such as switched 100Mbit/second Ethernet, may be
used to combine multiple workstations into a
larger parallel system, providing supercomputer
level of performance at the favorable
price/performance ratio. In this poster, we
describe COBALT cluster (Computers On Benches All
Linked Together) - a chemically oriented
supercomputer built in our research group at the
University of Calgary.
3Cobalt hardware Nodes
A node of the Cobalt cluster is a Compaq/Digital
Personal Workstation model 500au. Each
workstation is configured with
For a comparison, a top of the line 550MHz Intel
Xeon workstation with 512Kb of L2 cache achieves
24.4 SpecInt 95 and 17.1 SpecFP 95 and costs
about cdn4400 from Dell (May 1999).
() SpecInt and SpecFP values estimated from
published results for a 500au system with 2Mb L3
cache.
4Cobalt hardware Network
Cobalt nodes are communicate through a dedicated
96-port full-duplex 100BaseTx Ethernet Switch,
constructed from 4 24-port 3COM SuperStack II
3300 switches linked by a matrix module.
Latency and bandwidth measured with Larry McVoys
Lmbench using otherwise idle nodes.
5Cobalt hardware The Cluster
RAID, assembly and miscellaneous costs cdn6,500
6Cobalt system software
7Cobalt Single system image
Single system image (SSI), or ability of a group
of computers to present the illusion of a large
single computer system, is considered the
definitive characteristic of clusters. In order
to have a usability advantage over a pile of
individual computers, a cluster must provide its
users with the SSI covering most of the users
problem areas. Cobalt nodes present the illusion
of a single computer in several important
aspects, namely
8Cobalt application software
9Cobalt total cost
The complete per-node construction price,
including all hardware and software, is thus
substantially lower than the retail price of a
comparably equipped PC
10Running ADF in parallel on Cobalt
ADF has been parallelized at the Vrije University
in Amsterdam, and can utilize either MPI or PVM
message passing libraries. Parallelization has
been performed only for the computationally
intensive parts of the program (numerical
integration and density fitting). All relatively
inexpensive parts of the calculations are
repeated on all participating nodes, greatly
reducing the amount of data which have to be
communical over the network. In a typical ADF
run, the nodes have to synchronize only once per
SCF cycle or a gradient calculation.
11We illustrate the parallel performance of ADF for
full geometry optimization of nitridoporphyrinatoc
hromium(V), a medium-sized molecule with 38 atoms
shown on the left. This calculation used
polarized triple-? basis set on all atoms,
resulting in 580 basis functions. The molecule
was constrained to C4v symmetry. For this system,
a serial calculation takes 683 minutes on a
single Cobalt node (using 45Mb of memory and
about 100Mbytes of the disk space). For the
parallel runs, the execution time can be
approximated by the Amdahls law
where Tserial is the inherently serial part of
the calculation (21 minute), Tparallel is the
parallel part of the calculation (662 minutes),
and Toverhead is the parallel overhead (103
minutes).
12Running PAW in parallel on Cobalt
As a parallel application, PAW is the exact
opposite of ADF. Computationally, it is dominated
by fast Fourier transforms (FFTs), which place a
heavy demand on both the inter-node bandwidth and
round-trip latency. When running on n nodes,
parallel FFT algorithm used in PAW needs the
exchange all Fourier coefficients on each node
(which can easily require several hundred
megabytes of storage) n times during each
molecular dynamics (MD) step (see below),
resulting in a heavy communications traffic.
In a typical parallel PAW run on Cobalt, the
full-duplex 100Mbit/second communication links
between the nodes and the central switch
continuously run at over 20 utilization (or more
than 2.5Mbytes/second) in each direction. In a
sense, Cobalt nodes and communication network are
perfectly matched together for PAW runs having
faster CPUs would have made the communication
network choke on the data, while a slower
communication network would have been unable to
keep CPUs busy.
13To illustrate the performance of parallel PAW on
Cobalt, consider an SN2 substitution reaction
between CH3I and Rh(CO)2I2-. This medium-size
simulation was performed in an 11Ã… periodic cell.
In a serial run, a single time step requires
about 83 seconds a complete simulation consists
of several thousands steps. Fitting of the
measured execution times using different node
counts to the Amdahl law gives (all times in
seconds)
Unlike the ADF case, there the inherently serial
part constitutes less than 3 percent of the
total work, PAW spends almost 10 of the total
time in the parallel section. As a consequence,
PAW cannot efficiently utilize more than four
Cobalt nodes for this simulation.
14Molecular dynamics calculations in PAW are
frequently limited by the amount of memory
required to perform the calculation rather than
by the simulation time. In the parallel mode, PAW
can significantly reduce its per-node memory
requirements by distributing both the real-space
and Fourier-space grids between the nodes. Since
the size of the grids grows with the unit cell
size R as O(R3), they dominate PAW memory
requirements for all but smallest systems. In the
CH3I and Rh(CO)2I2- system, memory requirements
in the serial mode are relatively modest at 231
megabytes. In the parallel regime, per-node
memory requirements are given by
where Mprivate is the amount of memory holding
data private to a given node (7Mb), Mdistributed
is the amount of memory shared between the nodes
(224Mb), and Moverhead is the parallel overhead
(9Mb). Running this job on six nodes thus reduces
the per-node memory requirements to just
53Mbytes. Parallel PAW was used to run jobs
requiring almost 3Gbytes of memory on Cobalt -
even though no Cobalt node has more that 512Mb of
memory installed in it.
Per-node memory usage
Measured
ideal
Nodes
15Summary
We described construction of the Cobalt cluster -
a uniquely powerful and inexpensive dedicated
computational chemistry resource. With per-node
construction cost typical of high-end PCs, Cobalt
provides super-computer level of performance on
several quantum-chemical applications. Multiple
nodes can be utilized in parallel, resulting in
increased throughput and reduced wall-clock
execution time. Tens of nodes can be utilized
efficiently for a single large DFT calculation
using ADF. For further information on Cobalt
hardware and software, visit the Cobalt home page
at http//www.cobalt.chem.ucalgary.ca
Credits
- Financial support for the construction of the
Cobalt cluster was provided by - Canada Foundation for Innovation (CFI)
- Alberta Intellectual Infrastructure Partnership
program (AIIP) - Department of Chemistry of the University of
Calgary - Scientific Chemistry Simulations Inc.,
Netherlands - Mitsui Chemicals
- Nova Chemicals
16References and further reading
SpecFp95 and SpecInt95 benchmark results are
available on the web site of the Standard
Performance Evaluation Corp. (SPEC) at
http//www.specbench.org Prices and system
specifications of Dell workstation were taken
from the Dell Canada web site at
http//www.dell.ca Technical specifications of
the 3COM fast Ethernet switches are available on
the 3COM web site at http//www.3com.com Larry
McVoys Lmbench microbenchmark suite was
downloaded from the Bitmover web site at
http//www.bitmover.com/lmbench/ Greg Pfisters
In Search of Clusters, 2nd edition, published by
Prentice Hall in 1998 is the definitive guide to
clusters Additional information on the Amsterdam
density functional code is available on the web
site of Scientific Computing and Modeling at
http//www.scm.com Additional information on PAW
first-principles MD code is available on the
Cobalt web site at http//www.cobalt.chem.ucalgary
.ca/paw/ See the Gaussian Inc. web site at
http//www.gaussian.com/
SPEC
Dell
3COM
Lmbench
Clusters
ADF
PAW
Gaussian