Optimizing N-body Simulations for Multi-core Compute Clusters

1 / 34

About This Presentation

Title:

Optimizing N-body Simulations for Multi-core Compute Clusters

Description:

Title: N-Body simulations Author: Ammar Ahmad Awan Last modified by: ammar Created Date: 7/21/2006 6:15:31 PM Document presentation format: On-screen Show (4:3) –

Number of Views:84

Avg rating:3.0/5.0

Slides: 35

Provided by: AmmarAh

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing N-body Simulations for Multi-core Compute Clusters

1
Optimizing N-body Simulations for Multi-core
Compute Clusters

Ammar Ahmad Awan
BIT-6

2
Presentation Outline

Introduction
Design Implementation
Performance Evaluation
Conclusions and Future Work

3
Introduction

Sea change in the basic computer architecture
Power Consumption
Heat Dissipation
Emergence of multiple energy-efficient processing
cores instead of a single power-hungry core
Moores law will now be realized by increasing
core-count instead of increasing clock speeds
Impact on software applications
Change of focus from Instruction Level
Parallelism ( higher clock frequency) to Thread
Level Parallelism ( increasing core count )
Huge impact on High Performance Computing (HPC)
community
70 of the TOP500 supercomputers are based on
multi-core processors

4
Source Google Images
5
Source www.intel.com
6
SMP vs Multicore
Symmetric Multi-Processor
Multi-core Processor
7
HPC and Multi-core

Message Passing Interface (MPI) is the defacto
standard for programming todays supercomputers
Alternatives include OpenMP (for SMP machines)
and Unified Parallel C (UPC)
With the existing approaches, it is possible to
port MPI on multi-core processors
One MPI process per corewe call it the Pure
MPI approach
OpenMP threads inside MPI processwe call it
MPIthreads approach
We expect MPIthreads approach to be good
because
Communication cost for threads is lower than
processes
Threads are light-weight
We have evaluated this hypothesis by comparing
both approaches

8
Pure MPI vs MPIthreads approach
9
Sample Application N-body Simulations

To demonstrate the usefulness of our
MPIthreads approach, we chose N-body
simulation code
N-body or many body method is used for
simulating the evolution of a system consisting
of n bodies.
It has found a widespread use in the fields of
Astrophysics
Molecular Dynamics
Computational Biology

10
Summation Approach to solving N-body problems
The most compute intensive part of any N-body
method is the force calculation phase

The simplest expression for a far field force
f(i) on particle i is
for i 1 to n
f(i) sum j1,...,n, j ! i f(i,j)
end for
where f(i,j) is the force on particle i due to
particle j.

The cost of this calculation is O(n2)
11
Barnes Hut Tree

The Barnes-Hut algorithm is divided into 3 steps
Building the tree O( n log n )
Computing cell centers of mass O (n)
Computing Forces O( n log n )

Other popular methods are
Fast Multipole Method
Particle Mesh Method
TreePM Method
Symplectic Methods

12
Sample Application Gadget-2

Cosmological Simulation Code
Simulates a system of n bodies
Implements Barnes-Hut Algorithm
Written in C language parallelized with MPI
As part of this project
Understood the Gadget-2 code
How it is used in production mode
Modified the C code to use threads in the
Barnes-hut tree algorithm
Added performance counters to the code for
measuring cache utilization

13
Presentation Outline

Introduction
Design Implementation
Performance Evaluation
Conclusions and Future Work

13
14
Gadget-2 Architecture
15
Code Analysis
for ( i 0 to No. of particles n 0 to
BufferSize) calculate_force ( i )
for ( j 0 to No. of tasks )
export_particles ( j )

Original Code
parallel for ( i0 to n )
calculate_force( i ) for ( i 0 to No. of
particles n 0 to BufferSize ) for (
j 0 to No. of tasks ) export_particles
( j )
Modified Code
16
Presentation Outline

Introduction
Design Implementation
Performance Evaluation
Conclusions and Future Work

16
17
Evaluation Testbed

Our cluster called Chenab consists of nine nodes.
Each node consists of an
Intel Xeon Quad-Core Kentsfield Processor
2.4 GHz with 1066 MHZ FSB
4 MB L2 Cache / two cores
32 KB L1 Cache / core
2 GB main memory

18
Performance Evaluation

Performance evaluation is based on two main
parameters
Execution Time
Calculated directly from MPI wallclock timings
Cache Utilization
We patched the Linux kernel using perfctr patch
We selected the PerfAPI ( PAPI ) for hardware
performance counting
Used PAPI_L2_TCM (Total Cache Misses ) and
PAPI_L2_TCA (Total Cache Accesses ) to calculate
cache miss ratio
Results are shown on the upcoming slides
Execution Time for Colliding Galaxies
Execution Time for Cluster Formation
Execution Time for Custom Simulation
Cache Utilization for Cluster Formation

19
Execution Time for Colliding Galaxies
20
Execution Time for Cluster Formation
21
Execution Time for Custom Simulation
22
Cache Utilization for Cluster Formation
Cache utilization has been measured using
hardware counters provided by the kernel patch
(Perfctr) and PerfAPI (PAPI)
23
Presentation Outline

Introduction
Design Implementation
Performance Evaluation
Conclusions and Future Work

23
24
Conclusion

We optimized Gadget-2 which was our sample
application
MPIthreads approach performs better
The optimized code offers scalable performance
We are witnessing dramatic changes in core
designs for multicore systems
Heterogeneous and Homogeneous designs
Targeting a 1000 core processor will require
scalable frameworks and tools for programming

25
Conclusion

Towards Many-core computing
Multicore 2x / 2 yrs ? 64 cores in 8 years
Manycore 8x to 16x multicore

Source Dave Patterson, Overview of the Parallel
Laboratory
26
Future Work

Scalable Frameworks which provide programmer
friendly high level constructs are very important
PeakStream provides GPU and CPUGPU hybrid
programs
Cilk augment the C compiler with three new
keywords ( cilk_for, cilk_sync, cilk_spawn )
Research Accelerator for Multi Processors (RAMP)
can be used to simulate a 1000 core processor
Gadget-2 can be ported to GPUs using Nvidias
CUDA framework
xlc compiler to program the STI Cell Processor