Title: Latency Tolerance Through Parallelization of Time in Scientific Applications
1Latency Tolerance Through Parallelization of Time
in Scientific Applications
- Ashok Srinivasan
- Computer Science
- Florida State University
Namas Chandra Mechanical Engineering Florida
State University
Aim Long time scales on small physical
systems Solution features Time parallelization
to avoid fine granularity
www.cs.fsu.edu/asriniva
2Outline
- Application
- Time parallelization
- Prediction of a Carbon nanotube state
- Experimental results
- Conclusions and future work
3Applications
- Small physical systems for long time scales
- Class of applications considered
- State(Ti) F(StateTi-1)
- Inherently sequential
- Example
- Molecular dynamics simulations of Carbon
nanotubes - Time step size 10-15 second
- After a million steps, we are still only in the
nanosecond range - Even that requires about a day of sequential
computing time for around 3000 atoms - Spatial parallelization will lead to too fine a
granularity
4CNT application
- Pull the CNT at a constant velocity
- Performed to determine material property
- Material response can be used by an FEM
simulation in a multiscale model
5Time parallelization
- Based on a predict-verify approach
- Use results of old simulations to speed up the
current simulation - Relationship between different problem parameters
often occurs in engineering - Example Temperature and time, stress and time
- Find a relationship and use it to predict the
state at different times - The relationship is determined automatically, and
updated dynamically
6Guided simulations
- Notation
- r Exact time/ Parallel overhead
- P of Procs
- a Progress rate
- Speedup
- P a /(11/r)
- P a
- If prediction and communication overheads are
relatively small - P Time steps
- a ? 1/P,1
- Requires all-reduce and broadcast
7Fault tolerance too
- In case of node failure, another processor fills
in the missing time interval - Other computations need not be discarded
- Efficiency close to 1
- For large P
- Excluding loss in efficiency from errors
- If communication cost is negligible
- A master-worker design me be useful sometimes
Master
t1
t3
t4
t2
P3
P1
P2
P4
8Fault tolerance
- In case of node failure, another processor fills
in the missing time interval - Other computations need not be discarded
- Efficiency close to 1
- For large P
- Excluding loss in efficiency from errors
- If communication cost is negligible
- A master-worker design me be useful sometimes
Master
t2
t5
t6
P3
P1
P2
P4
9Requirements for this approach
- Method for predicting a state
- Criterion for determining whether two states
(predicted and actual) are similar - Choice of suitable base (old) simulation
10Prediction of a Carbon nanotube state
- Definition of equivalence of two states
- Atoms vibrate around their mean position
- Consider states equivalent if difference in
position, potential energy, and temperature are
within the normal range of fluctuations
- Max displacement 0.211
- Mean displacement 0.0789
- Potential energy fluctuation 0.35
- Temperature fluctuation 12.5 K
Displacement (from mean)
Mean position
11Prediction
- Predictor
- Independently predict change in each coordinate
- Normalize coordinates to be in 0,1
- x tDt x t x tDt Dt
- x tDt is the rate of change of x in this time
interval - It is unknown and needs to be estimated
12Predict change in coordinates
- Express x in terms of basis functions
- Example
- x tDt a0, tDt a1, tDt x t
- a0, tDt, a1, tDt are unknown
- Express changes, y, for the base (old) simulation
similarly, in terms of coefficients b and perform
least squares fit - Predict ai, tDt as bi, tDt R tDt
- R tDt (1-b) R tDt b(ai, t- bi, t)
- Intuitively, the difference between the base
coefficient and the current coefficient is
predicted as a weighted combination of previous
weights - We use b 0.5
- Gives more weight to latest results
- Does not let random fluctuations affect the
predictor too much - Velocity estimated as latest accurate results
known
13Experimental results
- Experimental parameters
- Carbon nanotube with 1000 atoms
- Around 200 atoms in the beginning fixed
- Around 200 atoms at the end moved
deterministically - Time step size 0.5 femto seconds
- Time interval per processor 1000 time steps
- Tersoff-Brenner potential for MD
- 300 K temperature current 10 K base
- b 0.5
- Base simulation v 0.05A/1000 time steps
- Actual simulation v 0.0625A/1000 time steps
- A parallel run was simulated
14Errors on 50 processors
Threshold for accepting the results
Difference between predictor and verifier
15Errors on 50 processors
Threshold for accepting the results
Difference between predictor and verifier
16Errors on 50 processors
Threshold for accepting the results
Error
Energy
Difference between predictor and verifier
17Errors on 50 processors
Temperature
Threshold for accepting the results
Error
Difference between predictor and verifier
18Speedup
Expected based on progress rate
Observed in simulations
- Computation time for one time interval 10 s
- Prediction time 10-3 s
- Broadcast on 100 processors of IBM SP3 0.005 s
- Allreduce on 100 processors of IBM SP3 0.0005 s
Overhead/computation is ratio negligible, and
speedup is determined only by errors
19Limitations of the experiments
- They are simulations of a parallel implementation
- But large difference between computation and
communication time suggests efficient
implementation
20Conclusions and future work
- Conclusions
- Promises significant improvement in speedup and
efficiency for long-time simulations, through
latency and fault-tolerance - Future work
- Implementation on a parallel machine
- Base simulations with a smaller time scale
- Better predictors
- Basis functions corresponding to physical
phenomena likely to be experienced - Use clustering techniques to determine phenomena
experienced by different regions - Automatically and dynamically determine a
suitable base to use from a large set of
existing results