Optimizing N-body Simulations for Multi-core Compute Clusters

1 / 34
About This Presentation
Title:

Optimizing N-body Simulations for Multi-core Compute Clusters

Description:

Title: N-Body simulations Author: Ammar Ahmad Awan Last modified by: ammar Created Date: 7/21/2006 6:15:31 PM Document presentation format: On-screen Show (4:3) –

Number of Views:84
Avg rating:3.0/5.0
Slides: 35
Provided by: AmmarAh
Category:

less

Transcript and Presenter's Notes

Title: Optimizing N-body Simulations for Multi-core Compute Clusters


1
Optimizing N-body Simulations for Multi-core
Compute Clusters
  • Ammar Ahmad Awan
  • BIT-6

2
Presentation Outline
  • Introduction
  • Design Implementation
  • Performance Evaluation
  • Conclusions and Future Work

3
Introduction
  • Sea change in the basic computer architecture
  • Power Consumption
  • Heat Dissipation
  • Emergence of multiple energy-efficient processing
    cores instead of a single power-hungry core
  • Moores law will now be realized by increasing
    core-count instead of increasing clock speeds
  • Impact on software applications
  • Change of focus from Instruction Level
    Parallelism ( higher clock frequency) to Thread
    Level Parallelism ( increasing core count )
  • Huge impact on High Performance Computing (HPC)
    community
  • 70 of the TOP500 supercomputers are based on
    multi-core processors

4
Source Google Images
5
Source www.intel.com
6
SMP vs Multicore
Symmetric Multi-Processor
Multi-core Processor
7
HPC and Multi-core
  • Message Passing Interface (MPI) is the defacto
    standard for programming todays supercomputers
  • Alternatives include OpenMP (for SMP machines)
    and Unified Parallel C (UPC)
  • With the existing approaches, it is possible to
    port MPI on multi-core processors
  • One MPI process per corewe call it the Pure
    MPI approach
  • OpenMP threads inside MPI processwe call it
    MPIthreads approach
  • We expect MPIthreads approach to be good
    because
  • Communication cost for threads is lower than
    processes
  • Threads are light-weight
  • We have evaluated this hypothesis by comparing
    both approaches

8
Pure MPI vs MPIthreads approach
9
Sample Application N-body Simulations
  • To demonstrate the usefulness of our
    MPIthreads approach, we chose N-body
    simulation code
  • N-body or many body method is used for
    simulating the evolution of a system consisting
    of n bodies.
  • It has found a widespread use in the fields of
  • Astrophysics
  • Molecular Dynamics
  • Computational Biology

10
Summation Approach to solving N-body problems
The most compute intensive part of any N-body
method is the force calculation phase
  • The simplest expression for a far field force
    f(i) on particle i is
  • for i 1 to n
  • f(i) sum j1,...,n, j ! i f(i,j)
  • end for
  • where f(i,j) is the force on particle i due to
    particle j.

The cost of this calculation is O(n2)
11
Barnes Hut Tree
  • The Barnes-Hut algorithm is divided into 3 steps
  • Building the tree O( n log n )
  • Computing cell centers of mass O (n)
  • Computing Forces O( n log n )
  • Other popular methods are
  • Fast Multipole Method
  • Particle Mesh Method
  • TreePM Method
  • Symplectic Methods

12
Sample Application Gadget-2
  • Cosmological Simulation Code
  • Simulates a system of n bodies
  • Implements Barnes-Hut Algorithm
  • Written in C language parallelized with MPI
  • As part of this project
  • Understood the Gadget-2 code
  • How it is used in production mode
  • Modified the C code to use threads in the
    Barnes-hut tree algorithm
  • Added performance counters to the code for
    measuring cache utilization

13
Presentation Outline
  • Introduction
  • Design Implementation
  • Performance Evaluation
  • Conclusions and Future Work

13
14
Gadget-2 Architecture
15
Code Analysis
for ( i 0 to No. of particles n 0 to
BufferSize) calculate_force ( i )
for ( j 0 to No. of tasks )
export_particles ( j )

Original Code
parallel for ( i0 to n )
calculate_force( i ) for ( i 0 to No. of
particles n 0 to BufferSize ) for (
j 0 to No. of tasks ) export_particles
( j )
Modified Code
16
Presentation Outline
  • Introduction
  • Design Implementation
  • Performance Evaluation
  • Conclusions and Future Work

16
17
Evaluation Testbed
  • Our cluster called Chenab consists of nine nodes.
  • Each node consists of an
  • Intel Xeon Quad-Core Kentsfield Processor
  • 2.4 GHz with 1066 MHZ FSB
  • 4 MB L2 Cache / two cores
  • 32 KB L1 Cache / core
  • 2 GB main memory

18
Performance Evaluation
  • Performance evaluation is based on two main
    parameters
  • Execution Time
  • Calculated directly from MPI wallclock timings
  • Cache Utilization
  • We patched the Linux kernel using perfctr patch
  • We selected the PerfAPI ( PAPI ) for hardware
    performance counting
  • Used PAPI_L2_TCM (Total Cache Misses ) and
    PAPI_L2_TCA (Total Cache Accesses ) to calculate
    cache miss ratio
  • Results are shown on the upcoming slides
  • Execution Time for Colliding Galaxies
  • Execution Time for Cluster Formation
  • Execution Time for Custom Simulation
  • Cache Utilization for Cluster Formation

19
Execution Time for Colliding Galaxies
20
Execution Time for Cluster Formation
21
Execution Time for Custom Simulation
22
Cache Utilization for Cluster Formation
Cache utilization has been measured using
hardware counters provided by the kernel patch
(Perfctr) and PerfAPI (PAPI)
23
Presentation Outline
  • Introduction
  • Design Implementation
  • Performance Evaluation
  • Conclusions and Future Work

23
24
Conclusion
  • We optimized Gadget-2 which was our sample
    application
  • MPIthreads approach performs better
  • The optimized code offers scalable performance
  • We are witnessing dramatic changes in core
    designs for multicore systems
  • Heterogeneous and Homogeneous designs
  • Targeting a 1000 core processor will require
    scalable frameworks and tools for programming

25
Conclusion
  • Towards Many-core computing
  • Multicore 2x / 2 yrs ? 64 cores in 8 years
  • Manycore 8x to 16x multicore

Source Dave Patterson, Overview of the Parallel
Laboratory
26
Future Work
  • Scalable Frameworks which provide programmer
    friendly high level constructs are very important
  • PeakStream provides GPU and CPUGPU hybrid
    programs
  • Cilk augment the C compiler with three new
    keywords ( cilk_for, cilk_sync, cilk_spawn )
  • Research Accelerator for Multi Processors (RAMP)
    can be used to simulate a 1000 core processor
  • Gadget-2 can be ported to GPUs using Nvidias
    CUDA framework
  • xlc compiler to program the STI Cell Processor

27
(No Transcript)
28
The Timeline
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Barnes Hut Tree
33
(No Transcript)
34
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com