Title: Adaptive Gridenabled SIMOX Simulation on JapanUS Grid Testbed
1Adaptive Grid-enabled SIMOX Simulation on
Japan-US Grid Testbed
- Yoshio Tanaka, Hiroshi Takemiya, Satoshi
SekiguchiAIST, Japan - Shuji OgataNagoya Institute of Technology,
Japan - Rajiv K. Kalia, Aiichiro Nakano, Priya
VashishtaUniversity of Southern California
2Hybrid QM/MD Simulation
- Enabling large scale simulation with quantum
accuracy - Combining classical MD Simulation with QM
simulation - MD simulation
- Simulating the behavior of atoms in the entire
region - Based on the classical MD using an empirical
inter-atomic potential - QM simulation
- Modifying energy calculated by MD simulation only
in the interesting regions - Based on the density functional theory (DFT)
3QM/MD simulation over the Pacific at SC2004
P32 (512 CPU)
TCS (512 CPU) _at_ PSC
Total number of CPUs 1792
P32 (512 CPU)
Ninf-G
MD Client
F32 (256 CPU)
corrosion of Sillicon under stress
Close-up view
4Lessons Learned and Next Steps
- Practically difficult to occupy a large-scale
single system for few weeks. - How can we long-run the simulation?
- Faults (e.g. HDD crush, network down) cannot be
avoided. - We dont prefer manual restart. The simulation
should be capable of automatic recovery from
faults. - How can the simulation recover from faults?
- Our latest adaptive QM/MD simulation allows the
problem size of embedded QM simulations to change
automatically during the simulation. - This will require the number of processors /
clusters change dynamically.
5Objectives
- Develop flexible, robust, and efficient
Grid-enabled simulation. - Flexible -- allow dynamic resource
allocation/migration, - robust -- detect errors and recover from faults
automatically for long runs, and - efficient -- manage thousands of CPUs.
- Verify our strategy through large-scale
experiments. - Implemented Grid-enabled SIMOX (Separation by
Implanted Oxygen) simulation - Run the simulation on Japan-US Grid testbed for
few weeks.
6Implementation using Ninf-G
- What is Ninf-G?
- A reference implementation of the GridRPC API
(GGF proposed recommendation) - Ninf-G includes
- C/C, Java APIs, libraries for software
development - IDL compiler for stub generation
- Shell scripts to
- compile client program
- build and publish remote libraries
- sample programs and manual documents
- Ninf-G is developed using Globus C and Java APIs
- Two major versions
- Version 4 (Ninf-G4)
- Works with GT4 WS GRAM as well as Pre-WS GRAM
- Has an interface for working with other Grid
middleware (e.g. Unicore) - The latest version is 4.1.0 (in NMI R9)
- Version 2 (Ninf-G2)
- Works with GT2 and pre-WS GRAM in GT3, GT4
- The latest version is 2.4.3
- Included in NMI Release 8 (the first non-U.S.
Software)
7Architecture of Ninf-G
Server side
Client side
IDL file
Numerical Library
grpc_call()
Client
IDL Compiler
grpc_function_handle_init()
Generate
Globus-IO
Interface Request/Reply
Remote Library Executable
GRAM
jobmanager pbs/sge/lsf
GRIS/GIIS
Interface Information LDIF File
retrieve
8Algorithm and Implementation
initial set-up
Calculate MD forces of QMMD regions
Data of QM atoms
Calculate QM force of the QM region
Calculate QM force of the QM region
Calculate QM force of the QM region
Calculate MD forces of QM region
MD part
QM part
QM forces
Update atomic positions and velocities
9SIMOX (Separation by Implanted Oxygen)
- A technique to fabricate a micro structure
consisting of Si surface on the thin SiO2
insulator - Allows to create higher speed with lower power
consumption device
- This technology has advantages for portable
products, such as laptops, hand-held devices, and
other applications that depend on battery power. - Further advancement of the SIMOX technology to
fabricate ultra-fine scale SOI structures in
future, requires to understand the effects of the
initial velocity and incident position of the
implanted oxygen on the oxidation processes.
10SIMOX simulation on the Grid
- Simulate SIMOX by implanting five oxygen atoms
with their initial velocities much smaller than
the usual values. - The incident positions of the oxygen atoms
relative to the surface crystalline structure of
Si differ. - 5 QM regions are initially defined
- Size and No. of QM regions are changed during the
simulation - 0.11million atoms in total
- Results of the experiments will demonstrate the
sensitivity of the process on the incident
position of the oxygen atom when its implantation
velocity is small.
11Testbed for the experiment
- AIST Super Clusters
- P32 (2144 CPUs), M64 (528 CPUs), F32 (536 CPUs)
- TeraGrid Clusters
- PSC clusters (3000 CPUs), NCSA clusters(1774
CPUs) - USC Clusters
- USC (7280 CPUs)
- Japan Clusters
- U-Tokyo (386 CPUs), TITECH (512 CPUs)
12Result of the experiment
- Experiment Time 18. 97 days
- Simulation steps 270 ( 54 fs)
- Longest continuous simulation 4.76 day
13Flexibility
- Expanding/Dividing regions of QM simulation at
every 5 time steps - number of QM atoms gradually increased from 62 to
341 - number of migrations of QM simulations was 244
- number of CPUs used for QM simulation was
increased from 10 to 708
14Robustness
- Many kinds of errors
- Queue was not activated
- Failed to start MPI programs
- Exceeding a quota limit
-
- Our application succeeded in detecting errors and
continuing simulation using other clusters
Intentional migration
unintentional migration
reservation finished
15Efficiency
- Communication time between QM and MD is
negligible - Computation time of QM 1 hour
- Communication time between QM and MD lt 1 min
- Execution efficiency was limited to about 60.
- Main causes
- Load imbalance among QM simulations
- Multiple assignment of QM regions on a single
cluster - Cost of fault detection and recovery
- Not easy to find appropriate timeout value and
number of retries
16Summary
- We could verify that our strategy for long run is
a practical approach - Continue the simulation by migrating from current
cluster to the other one either by intentionally
or unintentionally. - We could verify the programming using GridRPC and
MPI could implement real Grid-enabled application - Dynamic resource allocation / migration
- Recover from faults
- Manage hundreds of CPUs on distributed sites
17Summary (contd)
- Problem was heterogeneity
- NOT hardware, OS
- heterogeneity exists in more details of the
system configuration - AGW in PSC
- Strict firewall in USC
- max wall clock time for batch jobs
- disk quota limit
- Ninf-G could adapt to some of these issues, but
could not to the others - We need to ask special (manual) operation for our
experiments. But we encountered problems. - gave us a special (dedicated) queue
- need help for unexpected errors (jobs were not
activated) - more easy operation for cross-site reservation is
expected
18Acknowledgements
- Resource Providers
- TeraGrid
- Esp helpdesk admins in PSC and NCSA
- USC
- TITECH and U. Tokyo
- This work at AIST was partially supported by JST
(Japan Science and Technology Foundation) - This work at USC was partially supported by
AFOSR-DURINT, ARL-MURI, DOE, and NSF.