Co-processors for speeding up drug design algorithms

About This Presentation
Title:

Co-processors for speeding up drug design algorithms

Description:

To design FPGA based hardware accelerators for speeding up the energy minimization process. ... Hardware Design. Performance Analysis. Overall Control Flow ... –

Number of Views:25
Avg rating:3.0/5.0
Slides: 32
Provided by: cseIit
Category:

less

Transcript and Presenter's Notes

Title: Co-processors for speeding up drug design algorithms


1
Co-processors for speeding up drug design
algorithms
  • Advait Jain
  • Priyanka Jindal
  • Pulkit Gambhir
  • Under the guidance of
  • Prof. M Balakrishnan
  • Prof. Kolin Paul

2
Objective
  • To design FPGA based hardware accelerators for
    speeding up the energy minimization process.

3
Approach to the problem
  • Familiarization with the code
  • Software profiling
  • Identifying bottleneck procedures/loops
  • Compiler level optimizations
  • H/w - S/w partitioning
  • Where to partition
  • APIs to export
  • Hardware Design
  • Performance Analysis

4
Overall Control Flow
5
Bottleneck Functions
6
Bottleneck Functions
7
Split Up code
Eval_Energy_for _step() Diff_Energy()
Non-bonded pairs 68.61 29.10
Dihedrals 00.54 00.56
Angles 00.17 00.12
Bonded 00.00 00.00
8
Bottleneck Functions
Eval energy
Eval Energy for step
Diff energy
Iterate over list of bonds O(N) elements
Iterate over list of angles O(N) elements
Iterate over list of dihedrals O(N) elements
Iterate over list of non-bonded pairs O(N2)
elements
9
Molecule Size v/s Time (log plot)
Average Slope 2.03
10
Energy v/s CG Steps
We are here
11
Non-bonded List
Node structure Float A, B, C (43 bytes) Int a1,
a2
C is a function of charge q1 and q2 of atoms.
471,282 distinct Cs (3 bytes)
A, B Are a function of radius and epsilon of
atoms. 192 distinct pairs of A,B (1 byte)
12
New Data Structure
New Node structure
3d coordinates of atoms
Int a1, a2
Vector of Distinct Cs
Unsigned common_index
3
1
Vector of Distinct (A,B) pairs
13
Result of new data structure
  • Molecule Size 2008
  • VanderList 2,008,417
  • AB_Vander list 136
  • C_Vanderlist 21,651

Improvement in cache performance
Old Data Structure New Data Structure Projected Data Structure
2,008,417 20 40 MB 2,008,417 12 136 8 21,651 4 24 MB 2,008,417 8 136 8 21,651 4 16 MB
14
Sorting to improve performance
  • Consecutive nodes of van-der list can point
    randomly anywhere in the C and (A,B) vectors
  • Scope for further improving Cache performance
  • Radix sort on the van-der list
  • First bucket sort on the C-index
  • Second stable bucket sort on the (A,B)-index
  • Sequential access of (A,B) vector

15
Cache Profiling (unsorted vs sorted)
Test Case Molecule of size 413 atoms with 25 SD
and 100 CG steps
L1D refs L1D misses L2 refs
1,773,145,080 Rd 1,451,802,230 Wr 321,342,785 44,016,787 Rd (3) 43,429,781 Wr (.1826 ) 587,006 44,754,341 Rd 44,167,335 Wr 587,006
1,842,686,500 Rd 1,495,124,238 Wr 347,562,262 29,287,877 Rd (1.9) 28,470,590 Wr(.235) 817,287 30,152,893 Rd 29,335,606 Wr 817,287
16
Converting to floating point
  • All the code written with a double point
    precision
  • Double point difficult to replicate in hardware
  • Need to test feasibility of conversion to single
    precision

17
Single Point Precision
minEnergyCG()
Precision lost here
diffEnergy()
evalEnergy_for_step()
Instability introduced here Resulting in NaN
moveStep()
18
Single Point Precision
  • Removed the instability
  • Parabolic interpolation replaced by lnsearch()
    whenever points are colinear.
  • Time taken to evaluate the energy increased.
  • Increase in the number of calls to
    evalEnergy_for_step().

19
Slow Float Vs Double Time Plot
20
Control Flow
21
Single Point Precision (Molecule Size 2008
SD100 CG 150)
of Calls to EvalEnergyforStep() Double 642 Slow Float 893
From minEnergyCG() 450 450
From lnSearch() 192 443
Double Slow Float
of Calls to lnSearch() 100 177
evalEnergyForStep() per lnSearch() 1.92 2.5
22
Reducing the number of Calls
  • minEnergyCG
  • Parabolic interpolation which 3pts to choose.
  • Lnsearch
  • Iteratively calculates the step size.
  • When to stop the iteration determined by 2
    tolerances.
  • What we did
  • Pts for parabolic interpolation are further apart
  • Increased the tolerances till the time to
    minimize the energy was same as double.
  • Then profiled to check the actual energy.

23
Fast Float Vs Double Time Plot
24
Fast Float Vs Double Energy Plot
25
Our conclusions from this exercise
  • Located the source of instability.
  • However converting to float increased the time
    required for the code to run.
  • Increasing tolerances again made the code fast.
  • The energy in case of float did not agree well
    with double computation.

26
Feedback from SCF-Bio team
  • They are interested primarily in relaxing the
    molecule.
  • Actual energy is not of any consequence.
  • To check float-code, metric should be error
    between the molecular structures (float vs
    double).

27
New Checking Methodology
Start Structure
Double Relaxed Structure
Float Relaxed Structure
Acceptance lt 0.5
RMS Distance
28
RMS Distance vs CG Steps
We are here
29
Comparison with new metric
30
Tasks completed this semester
  • Software Profiling
  • No. of calls
  • Cache misses
  • Effect of parameters
  • Control Flow Analysis
  • Flow Diagram
  • Data parallelism
  • Floating point precision requirement
  • Exploring H/W Options
  • Platform Selection
  • S/W H/W Partitioning

31
Ongoing work next semester
  • Setting up building blocks
  • ZBT RAM access
  • PCI Interface
  • Floating Point Unit
  • Combining blocks for a simple implementation
  • Refining the implementation
  • Multiple compute engines
  • Multiple PCI cards
Write a Comment
User Comments (0)
About PowerShow.com