Title: Overview of Extreme-Scale Software Research in China
1Overview of Extreme-Scale Software Research in
China
- Depei Qian
- Sino-German Joint Software Institute (JSI)
- Beihang University
- China-USA Computer Software Workshop
- Sep. 27, 2011
2Outline
- Related RD efforts in China
- Algorithms and Computational Methods
- HPC and e-Infrastructure
- Parallel programming frameworks
- Programming heterogeneous systems
- Advanced compiler technology
- Tools
- Domain specific programming support
3Related RD efforts in China
- NSFC
- Basic algorithms and computable modeling for high
performance scientific computing - Network based research environment
- Many-core parallel programming
- 863 program
- High productivity computer and Grid service
environment - Multicore/many-core programming support
- HPC software for earth system modeling
- 973 program
- Parallel algorithms for large scale scientific
computing - Virtual computing environment
4Algorithms and Computational Methods
5NSFCs Key Initiative on Algorithm and Modeling
- Basic algorithms and computable modeling for high
performance scientific computing - 8-year, launched in 2011
- 180 million Yuan funding
- Focused on
- Novel computational methods and basic parallel
algorithms - Computable modeling for selected domains
- Implementation and verification of parallel
algorithms by simulation
6 7863s key projects on HPC and Grid
- High productivity Computer and Grid Service
Environment - Period 2006-2010
- 940 million Yuan from the MOST and more than 1B
Yuan matching money from other sources - Major RD activities
- Developing PFlops computers
- Building up a grid service environment--CNGrid
- Developing Grid and HPC applications in selected
areas
8CNGrid GOS Architecture
9Abstractions
- Grid community Agora
- persistent information storage and organization
- Grid process Grip
- runtime control
10CNGrid GOS deployment
- CNGrid GOS deployed on 11 sites and some
application Grids - Support heterogeneous HPCs Galaxy, Dawning,
DeepComp - Support multiple platforms Unix, Linux, Windows
- Using public network connection, enable only HTTP
port - Flexible client
- Web browser
- Special client
- GSML client
11Tsinghua University 1.33TFlops, 158TB storage,
29 applications, 100 users. IPV4/V6 access
CNIC 150TFlops, 1.4PB storage,30 applications,
269 users all over the country, IPv4/v6 access
IAPCM 1TFlops, 4.9TB storage, 10 applications,
138 users, IPv4/v6 access
Shandong University 10TFlops, 18TB storage, 7
applications, 60 users, IPv4/v6 access
GSCC 40TFlops, 40TB, 6 applications, 45 users ,
IPv4/v6 access
SSC 200TFlops, 600TB storage, 15 applications,
286 users, IPv4/v6 access
XJTU 4TFlops, 25TB storage, 14 applications,
120 users, IPv4/v6 access
USTC 1TFlops, 15TB storage, 18 applications, 60
users, IPv4/v6 access
HUST 1.7TFlops, 15TB storage, IPv4/v6 access
SIAT 10TFlops, 17.6TB storage, IPv4v6 access
HKU 20TFlops, 80 users, IPv4/v6 access
12CNGrid resources
- 11 sites
- gt450TFlops
- 2900TB storage
- Three PF-scale sites will be integrated into
CNGrid soon
13CNGridservices and users
- 230 services
- gt1400 users
- China commercial Aircraft Corp
- Bao Steel
- automobile
- institutes of CAS
- universities
14CNGridapplications
- Supporting gt700 projects
- 973, 863, NSFC, CAS Innovative, and Engineering
projects
15Parallel programming frameworks
16Jasmin A parallel programming Framework
Models Stencils Algorithms
Library
Models Stencils Algorithms
separate
Special
Common
Also supported by the 973 and 863 projects
Computers
17Basic ideas
- Hide the complexity of programming millons of
cores - Integrate the efficient implementations of
parallel fast numerical algorithms - Provide efficient data structures and solver
libraries - Support software engineering for code
extensibility.
18Basic Ideas
PetaFlops MPP
Scale up using Infrastructures
TeraFlops Cluster
Serial Programming
Personal Computer
19JASMIN
Structured Grid
Inertial Confinement Fusion Global Climate
Modeling CFD Material Simulations
Particle Simulation
Unstructured Grid
20JASMIN
User provides physics, parameters, numerical
methods,
expert experiences, special algorithms, etc.
User InterfacesComponents based Parallel
Programming models. ( C classes)
Numerical Algorithmsgeometry, fast solvers,
mature numerical methods, time integrators, etc.
HPC implementations( thousands of CPUs)data
structures, parallelization, load balancing,
adaptivity, visualization, restart, memory, etc.
ArchitectureMultilayered, Modularized,
Object-oriented Codes C/C/F90/F77MPI/OpenMP,
500,000 lines Installation Personal computers,
Cluster, MPP.
21Numerical simulations on TianHe-1A
Codes CPU cores Codes CPU cores
LARED-S 32,768 RH2D 1,024
LARED-P 72,000 HIME3D 3,600
LAP3D 16,384 PDD3D 4,096
MEPH3D 38,400 LARED-R 512
MD3D 80,000 LARED Integration 128
RT3D 1,000
Simulation duration several hours to tens of
hours.
22Programming heterogeneous systems
23GPU programming support
- Source to source translation
- Runtime optimization
- Mixed programming model for multi-GPU systems
24S2S translation for GPU
- A source-to-source translator, GPU-S2S, for GPU
programming - Facilitate the development of parallel programs
on GPU by combining automatic mapping and static
compilation
25S2S translation for GPU (cond)
- Insert directives into the source program
- Guide implicit call of CUDA runtime libraries
- Enable the user to control the mapping from the
homogeneous CPU platform to GPUs streaming
platform - Optimization based on runtime profiling
- Take full advantage of GPU according to the
application characteristics by collecting runtime
dynamic information.
26The GPU-S2S architecture
27Program translation by GPU-S2S
28Runtime optimization based on profiling
First level profiling (function level)
Second level profiling (memory access and kernel
improvement )
Third level profiling (data partition)
29First level profiling
- Identify computing kernels
- Instrument the scan source code, get the
execution time of every function, and identify
computing kernel
30Second level profiling
- Identify the memory access pattern and improve
the kernels - Instrument the computing kernels
- extract and analyze the profile information,
optimize according to the feature of application,
and finally generate the CUDA code with optimized
kernel
31Third level profiling
- Optimization by improve data partition
- Get copy time and computing time by
instrumentation - Compute the number of streams and data size of
each stream - Generate the optimized CUDA code with stream
32Matrix multiplication Performance comparison
before and after profile
The CUDA code with three level profiling
optimization achieves 31 improvement over the
CUDA code with only memory access optimization,
and 91 improvement over the CUDA code using only
global memory for computing .
Execution performance comparison on different
platform
33The CUDA code after three level profile
optimization achieves 38 improvement over the
CUDA code with memory access optimization, and
77 improvement over the CUDA code using only
global memory for computing .
FFT(1048576 points) Performance comparison
before and after profile
FFT(1048576 points ) execution performance
comparison on different platform
34Programming Multi-GPU systems
- The memory of the CPUGPU system are both
distributed and shared. So it is feasible to use
MPI and PGAS programming model for this new kind
of system. -
- MPI
PGAS - Using message passing or shared data for
communication between parallel tasks or GPUs
35Mixed Programming Model
NVIDIA GPU CUDA Traditional
Programming model MPI/UPC
MPICUDA/UPCCUDA
CUDA program execution
36MPICUDA experiment
- Platform
- 2NF5588 server, equipped with
- 1 Xeon CPU (2.27GHz), 12GB MM
- 2 NVIDIA Tesla C1060 GPU(GT200 architecture,4GB
deviceMM) - 1Gbt Ethernet
- RedHatLinux5.3
- CUDA Toolkit 2.3 and CUDA SDK
- OpenMPI 1.3
- BerkeleyUPC 2.1
37MPICUDA experiment (cond)
- Matrix Multiplication program
- Using block matrix multiply for UPC programming.
- Data spread on each UPC thread.
- The computing kernel carries out the
multiplication of two blocks at one time, using
CUDA to implement. - The total time of executionTsumTcomTcudaTcomT
copyTkernel - Tcom UPC thread communication time
- Tcuda CUDA program execution time
- Tcopy Data transmission time between host
and device - Tkernel GPU computing time
38MPICUDA experiment (cond)
2server,8 MPI task most
1 server with 2 GPUs
- For 40944096,the speedup of 1 MPICUDA task (
using 1 GPU for computing) is 184x of the case
with 8 MPI task. - For small scale data,such as 256,512 , the
execution time of using 2 GPUs is even longer
than using 1 GPUs - the computing scale is too small , the
communication between two tasks overwhelm the
reduction of computing time.
39PKU Manycore Software Research Group
- Software tool development for GPU clusters
- Unified multicore/manycore/clustering programming
- Resilience technology for very-large GPU clusters
- Software porting service
- Joint project, lt3k-line Code, supporting Tianhe
- Advanced training program
40PKU-Tianhe Turbulence Simulation
PKUFFT(using GPUs)
- Reach a scale 43 times higher than that of the
Earth Simulator did - 7168 nodes / 14336 CPUs / 7168 GPUs
- FFT speed 1.6X of Jaguar
- Proof of feasibility of GPU speed up for large
scale systems
MKL(not using GPUs)
Jaguar
41Advanced Compiler Technology
42Advanced Compiler Technology (ACT) Group at the
ICT, CAS
- ACTs Current research
- Parallel programming languages and models
- Optimized compilers and tools for HPC (Dawning)
and multi-core processors (Loongson) - Will lead the new multicore/many-core programming
support project
43PTA Process-based TAsk parallel programming model
- new process-based task construct
- With properties of isolation, atomicity and
deterministic submission - Annotate a loop into two parts, prologue and task
segment - pragma pta parallel clauses
- pragma pta task
- pragma pta propagate (varlist)
- Suitable for expressing coarse-grained, irregular
parallelism on loops - Implementation and performance
- PTA compiler, runtime system and assistant tool
(help writing correct programs) - Speedup 4.62 to 43.98 (average 27.58 on 48
cores) 3.08 to 7.83 (average 6.72 on 8 cores) - Code changes is within 10 lines, much smaller
than OpenMP
44UPC-H A Parallel Programming Model for Deep
Parallel Hierarchies
- Hierarchical UPC
- Provide multi-level data distribution
- Implicit and explicit hierarchical loop
parallelism - Hybrid execution model SPMD with fork-join
- Multi-dimensional data distribution and
super-pipelining - Implementations on CUDA clusters and Dawning 6000
cluster - Based on Berkeley UPC
- Enhance optimizations as localization and
communication optimization - Support SIMD intrinsics
- CUDA cluster72 of hand-tuned versions
performance, code reduction to 68 - Multi-core cluster better process mapping and
cache reuse than UPC
45OpenMP and Runtime Support for Heterogeneous
Platforms
- Heterogeneous platforms consisting of CPUs and
GPUs - Multiple GPUs, or CPU-GPU cooperation brings
extra data transfer hurting the performance gain - Programmers need unified data management system
- OpenMP extension
- Specify partitioning ratio to optimize data
transfer globally - Specify heterogeneous blocking sizes to reduce
false sharing among computing devices - Runtime support
- DSM system based on the blocking size specified
- Intelligent runtime prefetching with the help of
compiler analysis - Implementation and results
- On OpenUH compiler
- Gains 1.6X speedup through prefetching on NPB/SP
(class C)
46Analyzers based on Compiling Techniques for MPI
programs
- Communication slicing and process mapping tool
- Compiler part
- PDG Graph Building and slicing generation
- Iteration Set Transformation for approximation
- Optimized mapping tool
- Weighted graph, Hardware characteristic
- Graph partitioning and feedback-based evaluation
- Memory bandwidth measuring tool for MPI programs
- Detect the burst of bandwidth requirements
- Enhance the performance of MPI error checking
- Redundant error checking removal by dynamically
turning on/off the global error checking - With the help of compiler analysis on
communicators - Integrated with a model checking tool (ISP) and a
runtime checking tool (MARMOT)
47LoongCC An Optimizing Compiler for Loongson
Multicore Processors
- Based on Open64-4.2 and supporting C/C/Fortran
- Open source at http//svn.open64.net/svnroot/open6
4/trunk/ - Powerful optimizer and analyzer with better
performances - SIMD intrinsic support
- Memory locality optimization
- Data layout optimization
- Data prefetching
- Load/store grouping for 128-bit memory access
instructions - Integrated with Aggressive Auto Parallelization
Optimization (AAPO) module - Dynamic privatization
- Parallel model with dynamic alias optimization
- Array reduction optimization
48Tools
49Testing and evaluation of HPC systems
- A center led by Tsinghua University (Prof.
Wenguang Chen) - Developing accurate and efficient testing and
evaluation tools - Developing benchmarks for HPC evaluation
- Provide services to HPC developers and users
50LSP3AS large-scale parallel program performance
analysis system
- Designed for performance tuning on peta-scale HPC
systems - Method
- Source code is instrumented
- Instrumented code is executed, generating
profilingtracing data files - The profilingtracing data is analyzed and
visualization report is generated - Instrumentation based on TAU from University of
Oregon
51LSP3AS large-scale parallel program performance
analysis system
- Scalable performance data collection
- Distributed data collection and transmission
eliminate bottlenecks in network and data
processing - Dynamic Compensation reduce the influence of
performance data volume - Efficient Data Transmission use Remote Direct
Memory Access (RDMA) to achieve high bandwidth
and low latency
52LSP3AS large-scale parallel program performance
analysis system
- Analysis Visualization
- Data Analysis Iteration-based clustering are
used - Visualization Clustering visualization Based on
Hierarchy Classification
53SimHPC Parallel Simulator
- Challenge for HPC Simulation performance
- Target system gt1,000 nodes and processors
- Difficult for traditional architecture simulators
- e.g. Simics
- Our solution
- Parallel simulation
- Using cluster to simulate cluster
- Use same node in host system with the target
- Advantage no need to model and simulate detailed
components, such as pipeline in processors and
cache - Execution-driven, Full-system simulation, support
execution of Linux and applications include
benchmarks (e.g. Linpack)
54SimHPC Parallel Simulator (cond)
- Analysis
- Execution time of a process in target system is
composed of
- Trun execution time of instruction sequences
- TIO I/O blocking time, such as r/w files,
send/recv msgs - Tready waiting time in ready-state
- So, Our simulator needs to
- Capture system events
- process scheduling
- I/O operations read/write files, MPI
send()/recv() - Simulate I/O and interconnection network
subsystems - Synchronize timing of each application process
55SimHPC Parallel Simulator (cond)
- System Architecture
- Application processes of multiple target nodes
allocated to one host node - number of host nodes ltlt number of target nodes
- Events captured on host node while application is
running - Events sent to the central node for time
analysis, synchronization, and simulation
56SimHPC Parallel Simulator (cond)
- Experiment Results
- Host 5 IBM Blade HS21 (2-way Xeon)
- Target 32 1024 nodes
- OS Linux
- App Linpack HPL
Simulation Error Test
Communication time for Fat-tree and 2D-mesh
Interconnection networks
Linpack performance for Fat-tree and 2D-mesh
Interconnection networks
57System-level Power Management
- Power-aware Job Scheduling algorithm
- Suspend a node if its idle-time gt threshold
- Wakeup nodes if there is no enough nodes to
execute jobs, while - Avoid node thrashing between busy and suspend
state - The algorithm is integrated into OpenPBS
58System-level Power Management (cond)
- Power Management Tool
- Monitor the power-related status of the system
- Reduce runtime power consumption of the machine
- Multiple power management policies
- Manual-control
- On-demand control
- Suspend-enable
Layers of Power Management
59System-level Power Management (cond)
- Power Management Test
- On 5 IBM HS21 blades
Task Load (tasks per hour) Power Management Policy Task Exec. Time (s) Power Consumption(J) Comparison Comparison
Task Load (tasks per hour) Power Management Policy Task Exec. Time (s) Power Consumption(J) Performance slowdown Power Saving
20 On-demand 3.55 1778077 5.15 -1.66
20 Suspend 3.60 1632521 9.76 -12.74
200 On-demand 3.55 1831432 4.62 -3.84
200 Suspend 3.65 1683161 10.61 -10.78
800 On-demand 3.55 2132947 3.55 -7.05
800 Suspend 3.66 2123577 11.25 -9.34
Power management test for different Task
Load (Compared to no power management)
60Domain specific programming support
61Parallel Computing Platform for Astrophysics
- Joint work
- Shanghai Astronomical Observatory, CAS (SHAO),
- Institute of Software, CAS (ISCAS)
- Shanghai Supercomputer Center (SSC)
- Build a high performance parallel computing
software platform for astrophysics research,
focusing on the planetary fluid dynamics and
N-body problems - New parallel computing models and parallel
algorithms studied, validated and adopted to
achieve high performance.
62Architecture
63PETSc Optimized (Speedup 15-26)
- Method 1 Domain Decomposition Ordering Method
for Field Coupling - Method 2 Preconditioner for Domain Decomposition
Method - Method 3 PETSc Multi-physics Data Structure
Left mesh 128 x 128 x 96
Right mesh 192 x 192 x 128 Computation Speedup
15-26 Strong scalability Original code normal,
New code ideal Test environment BlueGene/L at
NCAR (HPCA2009)
64Strong Scalability on TianHe-1A
2021/6/12
65CLeXML Math Library
66BLAS2 Performance MKL vs. CLeXML
67HPC Software support for Earth System Modeling
- Led by Tsinghua University
- Tsinghua
- Beihang University
- Jiangnan Computing Institute
- Peking University
-
- Part of the national effort on climate change
study
67
68Earth System Model Development Workflow
68
69Major research activities
70(No Transcript)
71Expected Results
71
72Potential cooperation areas
- Software for exa-scale computer systems
- Power
- Performance
- Programmability
- resilience
- CPU/GPU hybrid programming
- Parallel algorithms and parallel program
frameworks - Large scale parallel applications support
- Applications requiring ExaFlops computers
73Thank you!