Overview of Extreme-Scale Software Research in China

About This Presentation

Title:

Overview of Extreme-Scale Software Research in China

Description:

Overview of Extreme-Scale Software Research in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University China-USA Computer Software Workshop – PowerPoint PPT presentation

Number of Views:162

Avg rating:3.0/5.0

Slides: 74

Provided by: luy55

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Overview of Extreme-Scale Software Research in China

1
Overview of Extreme-Scale Software Research in
China

Depei Qian
Sino-German Joint Software Institute (JSI)
Beihang University
China-USA Computer Software Workshop
Sep. 27, 2011

2
Outline

Related RD efforts in China
Algorithms and Computational Methods
HPC and e-Infrastructure
Parallel programming frameworks
Programming heterogeneous systems
Advanced compiler technology
Tools
Domain specific programming support

3
Related RD efforts in China

NSFC
Basic algorithms and computable modeling for high
performance scientific computing
Network based research environment
Many-core parallel programming
863 program
High productivity computer and Grid service
environment
Multicore/many-core programming support
HPC software for earth system modeling
973 program
Parallel algorithms for large scale scientific
computing
Virtual computing environment

4
Algorithms and Computational Methods
5
NSFCs Key Initiative on Algorithm and Modeling

Basic algorithms and computable modeling for high
performance scientific computing
8-year, launched in 2011
180 million Yuan funding
Focused on
Novel computational methods and basic parallel
algorithms
Computable modeling for selected domains
Implementation and verification of parallel
algorithms by simulation

HPC e-Infrastructure

7
863s key projects on HPC and Grid

High productivity Computer and Grid Service
Environment
Period 2006-2010
940 million Yuan from the MOST and more than 1B
Yuan matching money from other sources
Major RD activities
Developing PFlops computers
Building up a grid service environment--CNGrid
Developing Grid and HPC applications in selected
areas

8
CNGrid GOS Architecture
9
Abstractions

Grid community Agora
persistent information storage and organization
Grid process Grip
runtime control

10
CNGrid GOS deployment

CNGrid GOS deployed on 11 sites and some
application Grids
Support heterogeneous HPCs Galaxy, Dawning,
DeepComp
Support multiple platforms Unix, Linux, Windows
Using public network connection, enable only HTTP
port
Flexible client
Web browser
Special client
GSML client

11
Tsinghua University 1.33TFlops, 158TB storage,
29 applications, 100 users. IPV4/V6 access
CNIC 150TFlops, 1.4PB storage,30 applications,
269 users all over the country, IPv4/v6 access
IAPCM 1TFlops, 4.9TB storage, 10 applications,
138 users, IPv4/v6 access
Shandong University 10TFlops, 18TB storage, 7
applications, 60 users, IPv4/v6 access
GSCC 40TFlops, 40TB, 6 applications, 45 users ,
IPv4/v6 access
SSC 200TFlops, 600TB storage, 15 applications,
286 users, IPv4/v6 access
XJTU 4TFlops, 25TB storage, 14 applications,
120 users, IPv4/v6 access
USTC 1TFlops, 15TB storage, 18 applications, 60
users, IPv4/v6 access
HUST 1.7TFlops, 15TB storage, IPv4/v6 access
SIAT 10TFlops, 17.6TB storage, IPv4v6 access
HKU 20TFlops, 80 users, IPv4/v6 access
12
CNGrid resources

11 sites
gt450TFlops
2900TB storage
Three PF-scale sites will be integrated into
CNGrid soon

13
CNGridservices and users

230 services
gt1400 users
China commercial Aircraft Corp
Bao Steel
automobile
institutes of CAS
universities

14
CNGridapplications

Supporting gt700 projects
973, 863, NSFC, CAS Innovative, and Engineering
projects

15
Parallel programming frameworks
16
Jasmin A parallel programming Framework
Models Stencils Algorithms
Library
Models Stencils Algorithms
separate
Special
Common
Also supported by the 973 and 863 projects
Computers
17
Basic ideas

Hide the complexity of programming millons of
cores
Integrate the efficient implementations of
parallel fast numerical algorithms
Provide efficient data structures and solver
libraries
Support software engineering for code
extensibility.

18
Basic Ideas
PetaFlops MPP
Scale up using Infrastructures
TeraFlops Cluster
Serial Programming
Personal Computer
19
JASMIN
Structured Grid
Inertial Confinement Fusion Global Climate
Modeling CFD Material Simulations
Particle Simulation
Unstructured Grid
20
JASMIN
User provides physics, parameters, numerical
methods,
expert experiences, special algorithms, etc.
User InterfacesComponents based Parallel
Programming models. ( C classes)
Numerical Algorithmsgeometry, fast solvers,
mature numerical methods, time integrators, etc.
HPC implementations( thousands of CPUs)data
structures, parallelization, load balancing,
adaptivity, visualization, restart, memory, etc.
ArchitectureMultilayered, Modularized,
Object-oriented Codes C/C/F90/F77MPI/OpenMP,
500,000 lines Installation Personal computers,
Cluster, MPP.
21
Numerical simulations on TianHe-1A
Codes CPU cores Codes CPU cores
LARED-S 32,768 RH2D 1,024
LARED-P 72,000 HIME3D 3,600
LAP3D 16,384 PDD3D 4,096
MEPH3D 38,400 LARED-R 512
MD3D 80,000 LARED Integration 128
RT3D 1,000
Simulation duration several hours to tens of
hours.
22
Programming heterogeneous systems
23
GPU programming support

Source to source translation
Runtime optimization
Mixed programming model for multi-GPU systems

24
S2S translation for GPU

A source-to-source translator, GPU-S2S, for GPU
programming
Facilitate the development of parallel programs
on GPU by combining automatic mapping and static
compilation

25
S2S translation for GPU (cond)

Insert directives into the source program
Guide implicit call of CUDA runtime libraries
Enable the user to control the mapping from the
homogeneous CPU platform to GPUs streaming
platform
Optimization based on runtime profiling
Take full advantage of GPU according to the
application characteristics by collecting runtime
dynamic information.

26
The GPU-S2S architecture
27
Program translation by GPU-S2S
28
Runtime optimization based on profiling
First level profiling (function level)
Second level profiling (memory access and kernel
improvement )
Third level profiling (data partition)
29
First level profiling

Identify computing kernels
Instrument the scan source code, get the
execution time of every function, and identify
computing kernel

30
Second level profiling

Identify the memory access pattern and improve
the kernels
Instrument the computing kernels
extract and analyze the profile information,
optimize according to the feature of application,
and finally generate the CUDA code with optimized
kernel

31
Third level profiling

Optimization by improve data partition
Get copy time and computing time by
instrumentation
Compute the number of streams and data size of
each stream
Generate the optimized CUDA code with stream

32
Matrix multiplication Performance comparison
before and after profile
The CUDA code with three level profiling
optimization achieves 31 improvement over the
CUDA code with only memory access optimization,
and 91 improvement over the CUDA code using only
global memory for computing .
Execution performance comparison on different
platform
33
The CUDA code after three level profile
optimization achieves 38 improvement over the
CUDA code with memory access optimization, and
77 improvement over the CUDA code using only
global memory for computing .
FFT(1048576 points) Performance comparison
before and after profile
FFT(1048576 points ) execution performance
comparison on different platform
34
Programming Multi-GPU systems

The memory of the CPUGPU system are both
distributed and shared. So it is feasible to use
MPI and PGAS programming model for this new kind
of system.
MPI
PGAS
Using message passing or shared data for
communication between parallel tasks or GPUs

35
Mixed Programming Model
NVIDIA GPU CUDA Traditional
Programming model MPI/UPC
MPICUDA/UPCCUDA
CUDA program execution
36
MPICUDA experiment

Platform
2NF5588 server, equipped with
1 Xeon CPU (2.27GHz), 12GB MM
2 NVIDIA Tesla C1060 GPU(GT200 architecture,4GB
deviceMM)
1Gbt Ethernet
RedHatLinux5.3
CUDA Toolkit 2.3 and CUDA SDK
OpenMPI 1.3
BerkeleyUPC 2.1

37
MPICUDA experiment (cond)

Matrix Multiplication program
Using block matrix multiply for UPC programming.
Data spread on each UPC thread.
The computing kernel carries out the
multiplication of two blocks at one time, using
CUDA to implement.
The total time of executionTsumTcomTcudaTcomT
copyTkernel
Tcom UPC thread communication time
Tcuda CUDA program execution time
Tcopy Data transmission time between host
and device
Tkernel GPU computing time

38
MPICUDA experiment (cond)
2server,8 MPI task most
1 server with 2 GPUs

For 40944096,the speedup of 1 MPICUDA task (
using 1 GPU for computing) is 184x of the case
with 8 MPI task.
For small scale data,such as 256,512 , the
execution time of using 2 GPUs is even longer
than using 1 GPUs
the computing scale is too small , the
communication between two tasks overwhelm the
reduction of computing time.

39
PKU Manycore Software Research Group

Software tool development for GPU clusters
Unified multicore/manycore/clustering programming
Resilience technology for very-large GPU clusters
Software porting service
Joint project, lt3k-line Code, supporting Tianhe
Advanced training program

40
PKU-Tianhe Turbulence Simulation
PKUFFT(using GPUs)

Reach a scale 43 times higher than that of the
Earth Simulator did
7168 nodes / 14336 CPUs / 7168 GPUs
FFT speed 1.6X of Jaguar
Proof of feasibility of GPU speed up for large
scale systems

MKL(not using GPUs)
Jaguar
41
Advanced Compiler Technology
42
Advanced Compiler Technology (ACT) Group at the
ICT, CAS

ACTs Current research
Parallel programming languages and models
Optimized compilers and tools for HPC (Dawning)
and multi-core processors (Loongson)
Will lead the new multicore/many-core programming
support project

43
PTA Process-based TAsk parallel programming model

new process-based task construct
With properties of isolation, atomicity and
deterministic submission
Annotate a loop into two parts, prologue and task
segment
pragma pta parallel clauses
pragma pta task
pragma pta propagate (varlist)
Suitable for expressing coarse-grained, irregular
parallelism on loops
Implementation and performance
PTA compiler, runtime system and assistant tool
(help writing correct programs)
Speedup 4.62 to 43.98 (average 27.58 on 48
cores) 3.08 to 7.83 (average 6.72 on 8 cores)
Code changes is within 10 lines, much smaller
than OpenMP

44
UPC-H A Parallel Programming Model for Deep
Parallel Hierarchies

Hierarchical UPC
Provide multi-level data distribution
Implicit and explicit hierarchical loop
parallelism
Hybrid execution model SPMD with fork-join
Multi-dimensional data distribution and
super-pipelining
Implementations on CUDA clusters and Dawning 6000
cluster
Based on Berkeley UPC
Enhance optimizations as localization and
communication optimization
Support SIMD intrinsics
CUDA cluster72 of hand-tuned versions
performance, code reduction to 68
Multi-core cluster better process mapping and
cache reuse than UPC

45
OpenMP and Runtime Support for Heterogeneous
Platforms

Heterogeneous platforms consisting of CPUs and
GPUs
Multiple GPUs, or CPU-GPU cooperation brings
extra data transfer hurting the performance gain
Programmers need unified data management system
OpenMP extension
Specify partitioning ratio to optimize data
transfer globally
Specify heterogeneous blocking sizes to reduce
false sharing among computing devices
Runtime support
DSM system based on the blocking size specified
Intelligent runtime prefetching with the help of
compiler analysis
Implementation and results
On OpenUH compiler
Gains 1.6X speedup through prefetching on NPB/SP
(class C)

46
Analyzers based on Compiling Techniques for MPI
programs

Communication slicing and process mapping tool
Compiler part
PDG Graph Building and slicing generation
Iteration Set Transformation for approximation
Optimized mapping tool
Weighted graph, Hardware characteristic
Graph partitioning and feedback-based evaluation
Memory bandwidth measuring tool for MPI programs
Detect the burst of bandwidth requirements
Enhance the performance of MPI error checking
Redundant error checking removal by dynamically
turning on/off the global error checking
With the help of compiler analysis on
communicators
Integrated with a model checking tool (ISP) and a
runtime checking tool (MARMOT)

47
LoongCC An Optimizing Compiler for Loongson
Multicore Processors

Based on Open64-4.2 and supporting C/C/Fortran
Open source at http//svn.open64.net/svnroot/open6
4/trunk/
Powerful optimizer and analyzer with better
performances
SIMD intrinsic support
Memory locality optimization
Data layout optimization
Data prefetching
Load/store grouping for 128-bit memory access
instructions
Integrated with Aggressive Auto Parallelization
Optimization (AAPO) module
Dynamic privatization
Parallel model with dynamic alias optimization
Array reduction optimization

48
Tools
49
Testing and evaluation of HPC systems

A center led by Tsinghua University (Prof.
Wenguang Chen)
Developing accurate and efficient testing and
evaluation tools
Developing benchmarks for HPC evaluation
Provide services to HPC developers and users

50
LSP3AS large-scale parallel program performance
analysis system

Designed for performance tuning on peta-scale HPC
systems
Method
Source code is instrumented
Instrumented code is executed, generating
profilingtracing data files
The profilingtracing data is analyzed and
visualization report is generated
Instrumentation based on TAU from University of
Oregon

51
LSP3AS large-scale parallel program performance
analysis system

Scalable performance data collection
Distributed data collection and transmission
eliminate bottlenecks in network and data
processing
Dynamic Compensation reduce the influence of
performance data volume
Efficient Data Transmission use Remote Direct
Memory Access (RDMA) to achieve high bandwidth
and low latency

52
LSP3AS large-scale parallel program performance
analysis system

Analysis Visualization
Data Analysis Iteration-based clustering are
used
Visualization Clustering visualization Based on
Hierarchy Classification

53
SimHPC Parallel Simulator

Challenge for HPC Simulation performance
Target system gt1,000 nodes and processors
Difficult for traditional architecture simulators
e.g. Simics
Our solution
Parallel simulation
Using cluster to simulate cluster
Use same node in host system with the target
Advantage no need to model and simulate detailed
components, such as pipeline in processors and
cache
Execution-driven, Full-system simulation, support
execution of Linux and applications include
benchmarks (e.g. Linpack)

54
SimHPC Parallel Simulator (cond)

Analysis
Execution time of a process in target system is
composed of

Trun execution time of instruction sequences
TIO I/O blocking time, such as r/w files,
send/recv msgs
Tready waiting time in ready-state

So, Our simulator needs to
Capture system events
process scheduling
I/O operations read/write files, MPI
send()/recv()
Simulate I/O and interconnection network
subsystems
Synchronize timing of each application process

55
SimHPC Parallel Simulator (cond)

System Architecture
Application processes of multiple target nodes
allocated to one host node
number of host nodes ltlt number of target nodes
Events captured on host node while application is
running
Events sent to the central node for time
analysis, synchronization, and simulation

56
SimHPC Parallel Simulator (cond)

Experiment Results
Host 5 IBM Blade HS21 (2-way Xeon)
Target 32 1024 nodes
OS Linux
App Linpack HPL

Simulation Slowdown

Simulation Error Test
Communication time for Fat-tree and 2D-mesh
Interconnection networks
Linpack performance for Fat-tree and 2D-mesh
Interconnection networks
57
System-level Power Management

Power-aware Job Scheduling algorithm
Suspend a node if its idle-time gt threshold
Wakeup nodes if there is no enough nodes to
execute jobs, while
Avoid node thrashing between busy and suspend
state
The algorithm is integrated into OpenPBS

58
System-level Power Management (cond)

Power Management Tool
Monitor the power-related status of the system
Reduce runtime power consumption of the machine
Multiple power management policies
Manual-control
On-demand control
Suspend-enable

Layers of Power Management
59
System-level Power Management (cond)

Power Management Test
On 5 IBM HS21 blades

Task Load (tasks per hour) Power Management Policy Task Exec. Time (s) Power Consumption(J) Comparison Comparison
Task Load (tasks per hour) Power Management Policy Task Exec. Time (s) Power Consumption(J) Performance slowdown Power Saving
20 On-demand 3.55 1778077 5.15 -1.66
20 Suspend 3.60 1632521 9.76 -12.74
200 On-demand 3.55 1831432 4.62 -3.84
200 Suspend 3.65 1683161 10.61 -10.78
800 On-demand 3.55 2132947 3.55 -7.05
800 Suspend 3.66 2123577 11.25 -9.34
Power management test for different Task
Load (Compared to no power management)
60
Domain specific programming support
61
Parallel Computing Platform for Astrophysics

Joint work
Shanghai Astronomical Observatory, CAS (SHAO),
Institute of Software, CAS (ISCAS)
Shanghai Supercomputer Center (SSC)
Build a high performance parallel computing
software platform for astrophysics research,
focusing on the planetary fluid dynamics and
N-body problems
New parallel computing models and parallel
algorithms studied, validated and adopted to
achieve high performance.

62
Architecture
63
PETSc Optimized (Speedup 15-26)

Method 1 Domain Decomposition Ordering Method
for Field Coupling
Method 2 Preconditioner for Domain Decomposition
Method
Method 3 PETSc Multi-physics Data Structure

Left mesh 128 x 128 x 96
Right mesh 192 x 192 x 128 Computation Speedup
15-26 Strong scalability Original code normal,
New code ideal Test environment BlueGene/L at
NCAR (HPCA2009)
64
Strong Scalability on TianHe-1A
2021/6/12
65
CLeXML Math Library
66
BLAS2 Performance MKL vs. CLeXML
67
HPC Software support for Earth System Modeling

Led by Tsinghua University
Tsinghua
Beihang University
Jiangnan Computing Institute
Peking University
Part of the national effort on climate change
study

67
68
Earth System Model Development Workflow
68
69
Major research activities
70
(No Transcript)
71
Expected Results
71
72
Potential cooperation areas