Title: Optimizing N-body Simulations for Multi-core Compute Clusters
1Optimizing N-body Simulations for Multi-core
Compute Clusters
2Presentation Outline
- Introduction
- Design Implementation
- Performance Evaluation
- Conclusions and Future Work
3Introduction
- Sea change in the basic computer architecture
- Power Consumption
- Heat Dissipation
- Emergence of multiple energy-efficient processing
cores instead of a single power-hungry core - Moores law will now be realized by increasing
core-count instead of increasing clock speeds - Impact on software applications
- Change of focus from Instruction Level
Parallelism ( higher clock frequency) to Thread
Level Parallelism ( increasing core count ) - Huge impact on High Performance Computing (HPC)
community - 70 of the TOP500 supercomputers are based on
multi-core processors
4Source Google Images
5Source www.intel.com
6SMP vs Multicore
Symmetric Multi-Processor
Multi-core Processor
7HPC and Multi-core
- Message Passing Interface (MPI) is the defacto
standard for programming todays supercomputers - Alternatives include OpenMP (for SMP machines)
and Unified Parallel C (UPC) - With the existing approaches, it is possible to
port MPI on multi-core processors - One MPI process per corewe call it the Pure
MPI approach - OpenMP threads inside MPI processwe call it
MPIthreads approach - We expect MPIthreads approach to be good
because - Communication cost for threads is lower than
processes - Threads are light-weight
- We have evaluated this hypothesis by comparing
both approaches
8Pure MPI vs MPIthreads approach
9Sample Application N-body Simulations
- To demonstrate the usefulness of our
MPIthreads approach, we chose N-body
simulation code - N-body or many body method is used for
simulating the evolution of a system consisting
of n bodies. - It has found a widespread use in the fields of
- Astrophysics
- Molecular Dynamics
- Computational Biology
10Summation Approach to solving N-body problems
The most compute intensive part of any N-body
method is the force calculation phase
- The simplest expression for a far field force
f(i) on particle i is - for i 1 to n
- f(i) sum j1,...,n, j ! i f(i,j)
- end for
- where f(i,j) is the force on particle i due to
particle j.
The cost of this calculation is O(n2)
11Barnes Hut Tree
- The Barnes-Hut algorithm is divided into 3 steps
- Building the tree O( n log n )
- Computing cell centers of mass O (n)
- Computing Forces O( n log n )
- Other popular methods are
- Fast Multipole Method
- Particle Mesh Method
- TreePM Method
- Symplectic Methods
12Sample Application Gadget-2
- Cosmological Simulation Code
- Simulates a system of n bodies
- Implements Barnes-Hut Algorithm
- Written in C language parallelized with MPI
- As part of this project
- Understood the Gadget-2 code
- How it is used in production mode
- Modified the C code to use threads in the
Barnes-hut tree algorithm - Added performance counters to the code for
measuring cache utilization
13Presentation Outline
- Introduction
- Design Implementation
- Performance Evaluation
- Conclusions and Future Work
13
14Gadget-2 Architecture
15Code Analysis
for ( i 0 to No. of particles n 0 to
BufferSize) calculate_force ( i )
for ( j 0 to No. of tasks )
export_particles ( j )
Original Code
parallel for ( i0 to n )
calculate_force( i ) for ( i 0 to No. of
particles n 0 to BufferSize ) for (
j 0 to No. of tasks ) export_particles
( j )
Modified Code
16Presentation Outline
- Introduction
- Design Implementation
- Performance Evaluation
- Conclusions and Future Work
16
17Evaluation Testbed
- Our cluster called Chenab consists of nine nodes.
- Each node consists of an
- Intel Xeon Quad-Core Kentsfield Processor
- 2.4 GHz with 1066 MHZ FSB
- 4 MB L2 Cache / two cores
- 32 KB L1 Cache / core
- 2 GB main memory
18Performance Evaluation
- Performance evaluation is based on two main
parameters - Execution Time
- Calculated directly from MPI wallclock timings
- Cache Utilization
- We patched the Linux kernel using perfctr patch
- We selected the PerfAPI ( PAPI ) for hardware
performance counting - Used PAPI_L2_TCM (Total Cache Misses ) and
PAPI_L2_TCA (Total Cache Accesses ) to calculate
cache miss ratio - Results are shown on the upcoming slides
- Execution Time for Colliding Galaxies
- Execution Time for Cluster Formation
- Execution Time for Custom Simulation
- Cache Utilization for Cluster Formation
19Execution Time for Colliding Galaxies
20Execution Time for Cluster Formation
21Execution Time for Custom Simulation
22Cache Utilization for Cluster Formation
Cache utilization has been measured using
hardware counters provided by the kernel patch
(Perfctr) and PerfAPI (PAPI)
23Presentation Outline
- Introduction
- Design Implementation
- Performance Evaluation
- Conclusions and Future Work
23
24Conclusion
- We optimized Gadget-2 which was our sample
application - MPIthreads approach performs better
- The optimized code offers scalable performance
- We are witnessing dramatic changes in core
designs for multicore systems - Heterogeneous and Homogeneous designs
- Targeting a 1000 core processor will require
scalable frameworks and tools for programming
25Conclusion
- Towards Many-core computing
- Multicore 2x / 2 yrs ? 64 cores in 8 years
- Manycore 8x to 16x multicore
Source Dave Patterson, Overview of the Parallel
Laboratory
26Future Work
- Scalable Frameworks which provide programmer
friendly high level constructs are very important - PeakStream provides GPU and CPUGPU hybrid
programs - Cilk augment the C compiler with three new
keywords ( cilk_for, cilk_sync, cilk_spawn ) - Research Accelerator for Multi Processors (RAMP)
can be used to simulate a 1000 core processor - Gadget-2 can be ported to GPUs using Nvidias
CUDA framework - xlc compiler to program the STI Cell Processor
27(No Transcript)
28The Timeline
29(No Transcript)
30(No Transcript)
31(No Transcript)
32Barnes Hut Tree
33(No Transcript)
34(No Transcript)