Title: HPC At PNNL March 2004
1HPC At PNNLMarch 2004
R. Scott Studham, Associate Director Advanced
Computing April 13, 2004
2HPC Systems at PNNL
- Molecular Science Computing Facility
- 11.8TF Linux based supercomputer using Intel
Itanium2 processors and Elan4 interconnect - A balance for our users 500TB Disk, 6.8 TB
memory - PNNL Advanced Computing Center
- 128 Processor SGI Altix
- NNSA-ASC Spray Cool Cluster
3William R. Wiley Environmental Molecular
Sciences Laboratory
- Who are we?
- A 200,000 square-foot U.S. Department of Energy
national scientific user facility - Operated by Pacific Northwest National Laboratory
in Richland, Washington
- What we provide for you
- Free access to over 100 state-of-the-art research
instruments - A peer-review proposal process
- Expert staff to assist or collaborate
- Why use EMSL?
- EMSL provides - under one roof - staff and
instruments for fundamental research on physical,
chemical, and biological processes.
4HPCS2 Configuration
1,976 next generation Itanium processors
928 compute nodes
...
Elan4
Elan3
Lustre
2Gb SAN / 53TB
2 System Mgt nodes
4 Login nodes with 4Gb-Enet
The 11.8TF system is in full operations now.
11.8TF 6.8TB Memory
5Who uses the MSCF, and what do they run?
Gaussian
FY02 numbers
6MSCF is focused on grand challenges
Fewer users focused on Longer, Larger runs and
Big Science.
More than 67 of the usage is for large jobs.
Demand for access to this resource is high.
7The world-class science is enabled by having
systems that enable the fastest time-to-solution
for our science
- Significant improvement (25-45 for moderate
number of processors) in time to solution by
upgrading the interconnect to Elan4. - Improved efficiency
- Improved scalability
- HPCS2 is a science driven computer architecture
that has the fastest time-to-solution for our
users science of any system we have benchmarked.
8 Accurate binding energies for large water
clusters
- These results provide unique information on the
transition from the cluster to the liquid and
solid phases of water. - Code NWChem
- Kernel MP2 (Disk Bound)
- Sustained Performance 0.6 Gflop/s per processor
(10 of peak) - Choke Point Sustained 61GB/s of Disk IO and used
400TB of scratch space. - Only took 5 hours on 1024 CPUs of the HP cluster.
This is a capability class problem that could
not be completed on any other system.
9Energy calculation of a protein complex
- The Ras-RasGAP protein complex is a key switch in
the signaling network initiated by the epidermal
growth factor (EGF). This signal network
controls cell death and differentiation, and
mutations in the protein complex are responsible
for 30 of all human tumors. - Code NWChem
- Kernel Hartree-Fock
- Time for solution3 hours for one iteration on
1400 processors - Computation of 107 residues of the full protein
complex using approximately 15,000 basis
functions. This is believed to be the largest
calculation of its type.
10BiogeochemistryMembranes for Bioremediation
Molecular dynamics of a lipopolysaccharide (LPS)
HPCS1
Classical molecular dynamics of the LPS membrane
of Pseudomonas aeruginosa and mineral
Quantum mechanical/molecular mechanics molecular
dynamics of membrane plus mineral
HPCS2
HPCS3
11A new trend is emerging
Projected Growth Trend for Biology Log Scale!
- With the expansion into biology, the need for
storage has drastically increased. - EMSL users have stored gt50TB in the past 8
months. More than 80 of the data is from
experimentalists.
12Storage DriversWe support Three different
domains with different requirements
- High Performance Computing Chemistry
- Low storage volumes (10 TB)
- High performance storage (gt500MB/s per client,
GB/s aggregate) - POSIX access
- High Throughput Proteomics Biology
- Large storage volumes (PBs) and exploding
- Write once, read rarely if used as an archive
- Modest latency okay (lt10s to data)
- If analysis could be done in place it would
require faster storage - Atmospheric Radiation Measurement - Climate
- Modest side storage requirements (100s TB)
- Shared with community and replicated to ORNL
13PNNL's Lustre Implementation
- PNNL and the ASCI Tri-Labs are currently working
with CFS and HP to develop Lustre. - Lustre has been in full production since last Aug
and used for aggressive IO from our
supercomputer. - Highly stable
- Still hard to manage
- We are expanding our use of Lustre to act as the
filesystem for our archival storage. - Deploying a 400TB filesystem
660MB/s from a single client with a simple dd
is faster than any local or global filesystem we
have tested.
We are finally in the era where global
filesystems provide faster access
14Security
- Open computing requires a trust relationship
between sites. - User logs into siteA and sshs to siteB. If
siteA is compromised the hacker has probably
sniffed the password for siteB. - Reaction 1 Teach users to minimize jumping
through hosts they do not personally know are
secure (why did the user trust SiteA?) - Reaction 2 Implement one-time passwords
(SecureID) - Reaction 3 Turn off open access (Earth
simulator?)
15Thoughts about one-time-passwords
- A couple of different hurdles to cross
- We would like to avoid having to force our users
to carry a different SecureID card for each site
they have access to. - However the distributed nature of security (it is
run by local site policy) will probably end up
with something like this for the short term. - As of April 8th the MSCF has converted over to
the PNNL SecureID system for all remote ssh
logins. - Lots of FedExed SecureID cards
16Summary
- HPCS2 is running well and the IO capabilities of
the system are enabling chemistry and biology
calculations that could not be run on any other
system in the world. - Storage for proteomics is on a super-exponential
trend. - Lustre is great. 660MB/s from a single client.
Building 1/2PB single filesystem. - We rapidly implemented SecureID authentication
methods last week.