An FPGA Implementation of SPME Reciprocal Sum Compute Engine

About This Presentation

Title:

An FPGA Implementation of SPME Reciprocal Sum Compute Engine

Description:

What did we find out about precision? What did we find out about speedup? ... Due to limited logic resource limited precision FFT LogiCore. ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 38

Provided by: Sam358

Category:

more less

Transcript and Presenter's Notes

Title: An FPGA Implementation of SPME Reciprocal Sum Compute Engine

1
An FPGA Implementation of the Smooth Particle
Mesh Ewald Reciprocal Sum Compute Engine
(RSCE) Sam Lee
2
What is this Thesis about?

Implementation
Reciprocal Sum Compute Engine (RCSE).
FPGA based.
Accelerate part of Molecular Dynamics Sim.
Smooth Particle Mesh Ewald.
Investigation
Precision requirement.
Speedup capability.
Parallelization strategy.

3
Outline

What is Molecular Dynamics Simulation?
What calculations are involved?
How do we accelerate and parallelize the
calculations?
What did we find out about precision?
What did we find out about speedup?
What is left to be done?

4
Molecular Dynamics Simulation
5
Molecular Dynamics Simulation

E - V (Electric Field -Gradient of
Potential)
F QE (Force Charge x Electric Field)
F ma (Force Mass x Acceleration)
Time integration gt New Positions and Velocities

?
6
MD Simulation

Problem scientists are facing
SLOW!
O(N2).
N105, time-span1ns, timestep size1fs
gt 1022 calculations.
An 3GHz computer takes 5.8 x 1012 days to finish!!

7
Solution

Accelerate with FPGA
Especially
The O(N2) calculations.
To be more specific, the thesis addresses
Reciprocal Electrostatic energy and force
calculations.
Smooth Particle Mesh Ewald algorithm.

8
Previous Work

Software Implementations
Original PME Package written by Toukmaji.
NAMD2.
AMBER.
Hardware Implementations
No previous hardware implementations of SPME.
MD-Grape MD-Engine used Ewald Summation.
Ewald Summation is O(N2) SPME is O(NLogN)!

9
Calculations Involved

Smooth Particle Mesh Ewald

10
Electrostatic Interaction

Coulombic equation
Under the Periodic Boundary Condition, summation
is only
Conditionally Convergent.

11
Periodic Boundary Condition

To combat Surface Effect

Replication
12
Ewald Summation Used For PBC

To calculate for the Coulombic Interactions.
O(N2) Direct Sum O(N2) Reciprocal Sum.

Direct Sum
Reciprocal Sum
r
13
Smooth Particle Mesh Ewald

Shift the workload to the Reciprocal Sum.
Use Fast Fourier Transform.
O(N) Real O(NLogN) Reciprocal.
RSCE calculates the Reciprocal Sum using the SPME
algorithm.

14
SPME Reciprocal Energy
FFT
FFT
15
SPME Reciprocal Force
16
Reciprocal Sum Compute Engine(RSCE)
17
RSCE Validation Environment
18
RSCE Architecture
19
RSCE Verification Testbench
20
RSCE SystemC Model
21
MD Simulations with theRSCE
22
RSCE Precision Goal

Goal Relative error lt 10-5.
Two major calculation steps
B-Spline Calculation.
3D-FFT Calculation.
Due to limited logic resource limited precision
FFT LogiCore.
gt Precision goal CANNOT be achieved.

23
MD Simulation with RSCE

RMS Energy Error Fluctuation

24
FFT Precision Vs. Energy Fluctuation
25
Speedup Analysis

RSCE vs. Software Implementation

26
RSCE Speedup

RSCE _at_ 100MHz vs. P4 Intel _at_ 2.4GHz.
Speedup 3x to 14x
RSCE Computation time

27
RSCE Speedup

Why so insignificant?
QMM bandwidth limitation.
Sequential nature of the SPME algorithm.
Solution
Use more QMM memories.
Slight design modifications required.

28
Multi-QMM RSCE Speedup

NQ-QMM RSCE Computation time

The 4-QMM RSCE
Speedup 14x to 20x.
Assume N is of the same order as KxKxK
Speedup 3(NQ-1)x

29
RSCE Speedup
N P K Single-QMM Speedup against Software Four-QMM Speedup against Single-QMM Four-QMM Speedup against Software
Speedup 20000 4 32 5.44x 3.37 18x
Speedup 20000 4 64 6.97x 2.10 14x
Speedup 20000 4 128 10.70x 1.46 15x

Speedup 20000 8 32 3.72x 3.90 14x
Speedup 20000 8 64 5.17x 3.37 17x
Speedup 20000 8 128 7.94x 2.10 16x
x
30
Parallelization Strategy

When Multiple RSCEs are Used Together

31
RSCE Parallelization Strategy

Assume a 2-D Simulation.
Assume P2, K8, N6.
Assume NumP 4.

Four 4x4x4 Mini Meshes
An 8x8x8 mesh
32
RSCE Parallelization Strategy

Mini-mesh composed -gt 2D-IFFT
2D-IFFT two passes of 1D-FFT (X and Y).

Y Direction FFT
X Direction FFT
33
Parallelization Strategy

2D-IFFT -gt Energy Calculation -gt 2D-FFT
2D-FFT -gt Force Calculation

Energy Calculation
Force Calculation
2D-FFT
34
Multi-RSCE System
35
Conclusion

Successful integration of the RSCE into NAMD2.
Single-QMM RSCE Speedup 3x to 14x.
NQ-QMM RSCE Speedup 14x to 20x.
When NKxKxK, NQ-QMM Speedup (NQ-1)3x.
Multi-RSCE system is still a better alternative
than the Multi-FPGA Ewald Summation system.

36
Future Work

Input Precision Analysis.
More in-depth FFT Precision Analysis.
Implementation of block-floating Point FFT.
More investigation on how different simulation
setting (K, P, and N) affects the RSCE speedup.
Investigate how to better parallelize the SPME
algorithm.

37
Questions?

Write a Comment

User Comments (0)