Adaptive Gridenabled SIMOX Simulation on JapanUS Grid Testbed - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Adaptive Gridenabled SIMOX Simulation on JapanUS Grid Testbed

Description:

Adaptive Gridenabled SIMOX Simulation on JapanUS Grid Testbed – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 19

Provided by: yos447

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive Gridenabled SIMOX Simulation on JapanUS Grid Testbed

1
Adaptive Grid-enabled SIMOX Simulation on
Japan-US Grid Testbed

Yoshio Tanaka, Hiroshi Takemiya, Satoshi
SekiguchiAIST, Japan
Shuji OgataNagoya Institute of Technology,
Japan
Rajiv K. Kalia, Aiichiro Nakano, Priya
VashishtaUniversity of Southern California

2
Hybrid QM/MD Simulation

Enabling large scale simulation with quantum
accuracy
Combining classical MD Simulation with QM
simulation
MD simulation
Simulating the behavior of atoms in the entire
region
Based on the classical MD using an empirical
inter-atomic potential
QM simulation
Modifying energy calculated by MD simulation only
in the interesting regions
Based on the density functional theory (DFT)

3
QM/MD simulation over the Pacific at SC2004
P32 (512 CPU)
TCS (512 CPU) _at_ PSC
Total number of CPUs 1792
P32 (512 CPU)
Ninf-G
MD Client
F32 (256 CPU)
corrosion of Sillicon under stress
Close-up view
4
Lessons Learned and Next Steps

Practically difficult to occupy a large-scale
single system for few weeks.
How can we long-run the simulation?
Faults (e.g. HDD crush, network down) cannot be
avoided.
We dont prefer manual restart. The simulation
should be capable of automatic recovery from
faults.
How can the simulation recover from faults?
Our latest adaptive QM/MD simulation allows the
problem size of embedded QM simulations to change
automatically during the simulation.
This will require the number of processors /
clusters change dynamically.

5
Objectives

Develop flexible, robust, and efficient
Grid-enabled simulation.
Flexible -- allow dynamic resource
allocation/migration,
robust -- detect errors and recover from faults
automatically for long runs, and
efficient -- manage thousands of CPUs.
Verify our strategy through large-scale
experiments.
Implemented Grid-enabled SIMOX (Separation by
Implanted Oxygen) simulation
Run the simulation on Japan-US Grid testbed for
few weeks.

6
Implementation using Ninf-G

What is Ninf-G?
A reference implementation of the GridRPC API
(GGF proposed recommendation)
Ninf-G includes
C/C, Java APIs, libraries for software
development
IDL compiler for stub generation
Shell scripts to
compile client program
build and publish remote libraries
sample programs and manual documents
Ninf-G is developed using Globus C and Java APIs
Two major versions
Version 4 (Ninf-G4)
Works with GT4 WS GRAM as well as Pre-WS GRAM
Has an interface for working with other Grid
middleware (e.g. Unicore)
The latest version is 4.1.0 (in NMI R9)
Version 2 (Ninf-G2)
Works with GT2 and pre-WS GRAM in GT3, GT4
The latest version is 2.4.3
Included in NMI Release 8 (the first non-U.S.
Software)

7
Architecture of Ninf-G
Server side
Client side
IDL file
Numerical Library
grpc_call()
Client
IDL Compiler
grpc_function_handle_init()
Generate
Globus-IO
Interface Request/Reply
Remote Library Executable
GRAM
jobmanager pbs/sge/lsf
GRIS/GIIS
Interface Information LDIF File
retrieve
8
Algorithm and Implementation

Algorithm
Implementation

initial set-up
Calculate MD forces of QMMD regions
Data of QM atoms
Calculate QM force of the QM region
Calculate QM force of the QM region
Calculate QM force of the QM region
Calculate MD forces of QM region
MD part
QM part
QM forces
Update atomic positions and velocities
9
SIMOX (Separation by Implanted Oxygen)

A technique to fabricate a micro structure
consisting of Si surface on the thin SiO2
insulator
Allows to create higher speed with lower power
consumption device

This technology has advantages for portable
products, such as laptops, hand-held devices, and
other applications that depend on battery power.
Further advancement of the SIMOX technology to
fabricate ultra-fine scale SOI structures in
future, requires to understand the effects of the
initial velocity and incident position of the
implanted oxygen on the oxidation processes.

10
SIMOX simulation on the Grid

Simulate SIMOX by implanting five oxygen atoms
with their initial velocities much smaller than
the usual values.
The incident positions of the oxygen atoms
relative to the surface crystalline structure of
Si differ.
5 QM regions are initially defined
Size and No. of QM regions are changed during the
simulation
0.11million atoms in total
Results of the experiments will demonstrate the
sensitivity of the process on the incident
position of the oxygen atom when its implantation
velocity is small.

11
Testbed for the experiment

AIST Super Clusters
P32 (2144 CPUs), M64 (528 CPUs), F32 (536 CPUs)
TeraGrid Clusters
PSC clusters (3000 CPUs), NCSA clusters(1774
CPUs)
USC Clusters
USC (7280 CPUs)
Japan Clusters
U-Tokyo (386 CPUs), TITECH (512 CPUs)

12
Result of the experiment

Experiment Time 18. 97 days
Simulation steps 270 ( 54 fs)
Longest continuous simulation 4.76 day

13
Flexibility

Expanding/Dividing regions of QM simulation at
every 5 time steps
number of QM atoms gradually increased from 62 to
341
number of migrations of QM simulations was 244
number of CPUs used for QM simulation was
increased from 10 to 708

14
Robustness

Many kinds of errors
Queue was not activated
Failed to start MPI programs
Exceeding a quota limit
Our application succeeded in detecting errors and
continuing simulation using other clusters

Intentional migration
unintentional migration
reservation finished
15
Efficiency

Communication time between QM and MD is
negligible
Computation time of QM 1 hour
Communication time between QM and MD lt 1 min
Execution efficiency was limited to about 60.
Main causes
Load imbalance among QM simulations
Multiple assignment of QM regions on a single
cluster
Cost of fault detection and recovery
Not easy to find appropriate timeout value and
number of retries

16
Summary

We could verify that our strategy for long run is
a practical approach
Continue the simulation by migrating from current
cluster to the other one either by intentionally
or unintentionally.
We could verify the programming using GridRPC and
MPI could implement real Grid-enabled application
Dynamic resource allocation / migration
Recover from faults
Manage hundreds of CPUs on distributed sites

17
Summary (contd)

Problem was heterogeneity
NOT hardware, OS
heterogeneity exists in more details of the
system configuration
AGW in PSC
Strict firewall in USC
max wall clock time for batch jobs
disk quota limit
Ninf-G could adapt to some of these issues, but
could not to the others
We need to ask special (manual) operation for our
experiments. But we encountered problems.
gave us a special (dedicated) queue
need help for unexpected errors (jobs were not
activated)
more easy operation for cross-site reservation is
expected

18
Acknowledgements

Resource Providers
TeraGrid
Esp helpdesk admins in PSC and NCSA
USC
TITECH and U. Tokyo
This work at AIST was partially supported by JST
(Japan Science and Technology Foundation)
This work at USC was partially supported by
AFOSR-DURINT, ARL-MURI, DOE, and NSF.

Write a Comment

User Comments (0)