Title: High Performance Computing
1High Performance Computing
- Kai Wang
- Department of Computer Science
- University of South Dakota
- http//www.usd.edu/Kai.Wang
2Goal of the Talk
- In previous lesson, we learned some basic
knowledge of high performance computing - What is high performance computing
- Hardware and software
- A simple parallel code
- In this talk, you will get to know some research
interests in high performance computing - Algorithm design
- Language
3High Performance Computing II
- Kai Wang
- Department of Computer Science
- University of South Dakota
- http//www.usd.edu/Kai.Wang
4Table of Contents
- Research in high performance computing
- Parallel algorithm of sparse approximate inverse
- Parallel language design
5Research in High Performance Computing
- Hardware
- Design high speed CPU, memory
- Build high performance computer
- Software
- Theory
- Mechanism
- Language design
- Resource management
- Network connection, graph theory
- Input/output
- Application
- Given an application, design its algorithm on
high performance computers - Tools, libraries
- General purpose use, ACTs (Adanvanced
CompuTational software)
6Research in Algorithm Design
- High performance computing Parallel computing
- High performance computing is on high performance
computer - High performance computing is more concerned with
performance - Code efficiency
- Scalability
- Therefore, we focus on parallel algorithm design
7Example of Algorithm Design
- A very popular question
- Ax b
8Example of Algorithm Design
- Various scientific modeling and simulation
problems are modeled by partial differential
equations - The equations are discretized
- Get
- Ax b
- It needs to be solved
9Why Do High Performance Computing
- The matrix A is usually Large millions of
unknowns - The problem is large-scale
- People is trying to get accurate simulation and
modeling results which requires the
discretization of the PDE to be detailed enough - The solution of x costs huge amount of CPU time
and storage - Gaussian elimination
- O(n2) memory cost, O(n3) computational cost
- No one wants to wait months
10No Parallelism
- So we need to do
- Axb
- On high performance computers
- The most robust algorithm is Gaussian elimination
- There is no parallelism in the algorithm
- It is difficult to implement on parallel computers
11Why No Parallelism
12Why No Parallelism
13Why No Parallelism
14Iterative Methods
- Matrix vector products
- Parallelism is not a problem now
- Jacobi, Gauss-Seidel
- Not robust, not efficient
- Multigrid, Krylov subspace(CG, GMRES)
- Not enough robust
- Robustness is a problem
- Preconditioning Krylov
15Preconditioning
- Ax b
- is difficult to solve by iterative methods
because A is ill-conditioned - transform to MAx Mb
- M is called the preconditioner
- This process is called preconditioning
16Parallel Preconditioning
- At least 50 different ways to compute
- M
- Not all of them is suitable for high performance
computers
17Localized Parallel Techniques
- Only compute the diagonal part
18Sparse Approximate Inverse
19Why Have Parallelism
- The problem can be changed to
- The degree of parallelism is n
20Research in Language Design
- The running of program on high performance
computer need - Hardware
- High performance computer
- Available number of processors
- Software
- Parallel environment support
- Resource management
- A language supporting parallel program
- Parallel program written in the language
21Message Passing Interface
- MPI is the most parallel popular programming
model - More than 90 parallel program
- but not the best
- Given P processors
- It requires the programmer to decompose the
computation into exactly P pieces - Each piece is assigned to 1 processor
- mpi np p progamname
22Why not Good
- Software engineering
- The problem is decomposed not according to the
nature, but the number of physical processors - Load balancing
- The programmer has to control the decomposition
carefully - Additional coding
- Especially for dynamic applications
23Processor Virtualization
- Get rid of the limitation of physical processors
- Give the runtime system power to control
- The decomposition is based on the nature of the
problem - The problem is divided into m piece
- Runtime system maps these m pieces to p physical
processors
24Why is Good
- Cache performance
- Each piece is smaller, easy to be fit in the
cache - Can improve the cache performance even the code
is cache optimized
25Why is Good
- Overlapping of communication and computation
26(No Transcript)
27Implementation
- Charm
- C language
- Each piece of computation is represented by a
chare object - Message driven communication
28Implementation
- AMPI
- Same as MPI, but support virtualization
- Has the automatic load balancing support
- Message passing communication
- ampi np p vp m programname
29NAMD A Production MD program
- NAMD
- Fully featured program
- NIH-funded development
- Distributed free of charge (5000 downloads so
far) - Binaries and source code
- Installed at NSF centers
30NAMD on Lemieux without PME
ATPase 327,000 atoms including water
31Conclusion
- Information and knowledge are growing
exponentially - People need increasing computer power
- High performance computing will get more and more
applications and be the next generation computing
technique