Title: An FPGA Implementation of SPME Reciprocal Sum Compute Engine
1An FPGA Implementation of the Smooth Particle
Mesh Ewald Reciprocal Sum Compute Engine
(RSCE) Sam Lee
2What is this Thesis about?
- Implementation
- Reciprocal Sum Compute Engine (RCSE).
- FPGA based.
- Accelerate part of Molecular Dynamics Sim.
- Smooth Particle Mesh Ewald.
- Investigation
- Precision requirement.
- Speedup capability.
- Parallelization strategy.
3Outline
- What is Molecular Dynamics Simulation?
- What calculations are involved?
- How do we accelerate and parallelize the
calculations? - What did we find out about precision?
- What did we find out about speedup?
- What is left to be done?
4Molecular Dynamics Simulation
5Molecular Dynamics Simulation
- E - V (Electric Field -Gradient of
Potential) - F QE (Force Charge x Electric Field)
- F ma (Force Mass x Acceleration)
- Time integration gt New Positions and Velocities
?
6MD Simulation
- Problem scientists are facing
- SLOW!
- O(N2).
- N105, time-span1ns, timestep size1fs
gt 1022 calculations. - An 3GHz computer takes 5.8 x 1012 days to finish!!
7Solution
- Accelerate with FPGA
- Especially
- The O(N2) calculations.
- To be more specific, the thesis addresses
- Reciprocal Electrostatic energy and force
calculations. - Smooth Particle Mesh Ewald algorithm.
8Previous Work
- Software Implementations
- Original PME Package written by Toukmaji.
- NAMD2.
- AMBER.
- Hardware Implementations
- No previous hardware implementations of SPME.
- MD-Grape MD-Engine used Ewald Summation.
- Ewald Summation is O(N2) SPME is O(NLogN)!
9Calculations Involved
- Smooth Particle Mesh Ewald
10Electrostatic Interaction
- Coulombic equation
- Under the Periodic Boundary Condition, summation
is only - Conditionally Convergent.
11Periodic Boundary Condition
Replication
12Ewald Summation Used For PBC
- To calculate for the Coulombic Interactions.
- O(N2) Direct Sum O(N2) Reciprocal Sum.
Direct Sum
Reciprocal Sum
r
13Smooth Particle Mesh Ewald
- Shift the workload to the Reciprocal Sum.
- Use Fast Fourier Transform.
- O(N) Real O(NLogN) Reciprocal.
- RSCE calculates the Reciprocal Sum using the SPME
algorithm.
14SPME Reciprocal Energy
FFT
FFT
15SPME Reciprocal Force
16Reciprocal Sum Compute Engine(RSCE)
17RSCE Validation Environment
18RSCE Architecture
19RSCE Verification Testbench
20RSCE SystemC Model
21MD Simulations with theRSCE
22RSCE Precision Goal
- Goal Relative error lt 10-5.
- Two major calculation steps
- B-Spline Calculation.
- 3D-FFT Calculation.
- Due to limited logic resource limited precision
FFT LogiCore. - gt Precision goal CANNOT be achieved.
23MD Simulation with RSCE
- RMS Energy Error Fluctuation
24FFT Precision Vs. Energy Fluctuation
25Speedup Analysis
- RSCE vs. Software Implementation
26RSCE Speedup
- RSCE _at_ 100MHz vs. P4 Intel _at_ 2.4GHz.
- Speedup 3x to 14x
- RSCE Computation time
27RSCE Speedup
- Why so insignificant?
- QMM bandwidth limitation.
- Sequential nature of the SPME algorithm.
- Solution
- Use more QMM memories.
- Slight design modifications required.
28Multi-QMM RSCE Speedup
- NQ-QMM RSCE Computation time
- The 4-QMM RSCE
- Speedup 14x to 20x.
- Assume N is of the same order as KxKxK
- Speedup 3(NQ-1)x
29RSCE Speedup
N P K Single-QMM Speedup against Software Four-QMM Speedup against Single-QMM Four-QMM Speedup against Software
Speedup 20000 4 32 5.44x 3.37 18x
Speedup 20000 4 64 6.97x 2.10 14x
Speedup 20000 4 128 10.70x 1.46 15x
Speedup 20000 8 32 3.72x 3.90 14x
Speedup 20000 8 64 5.17x 3.37 17x
Speedup 20000 8 128 7.94x 2.10 16x
x
30Parallelization Strategy
- When Multiple RSCEs are Used Together
31RSCE Parallelization Strategy
- Assume a 2-D Simulation.
- Assume P2, K8, N6.
- Assume NumP 4.
Four 4x4x4 Mini Meshes
An 8x8x8 mesh
32RSCE Parallelization Strategy
- Mini-mesh composed -gt 2D-IFFT
- 2D-IFFT two passes of 1D-FFT (X and Y).
Y Direction FFT
X Direction FFT
33Parallelization Strategy
- 2D-IFFT -gt Energy Calculation -gt 2D-FFT
- 2D-FFT -gt Force Calculation
Energy Calculation
Force Calculation
2D-FFT
34Multi-RSCE System
35Conclusion
- Successful integration of the RSCE into NAMD2.
- Single-QMM RSCE Speedup 3x to 14x.
- NQ-QMM RSCE Speedup 14x to 20x.
- When NKxKxK, NQ-QMM Speedup (NQ-1)3x.
- Multi-RSCE system is still a better alternative
than the Multi-FPGA Ewald Summation system.
36Future Work
- Input Precision Analysis.
- More in-depth FFT Precision Analysis.
- Implementation of block-floating Point FFT.
- More investigation on how different simulation
setting (K, P, and N) affects the RSCE speedup. - Investigate how to better parallelize the SPME
algorithm.
37Questions?