Title: Parallel Programming Environment for SMP Clusters
1Parallel Programming Environment for SMP Clusters
- Seoul National University
- Co-Design and Parallel Processing Lab.
- Yang-Suk Kee
2Contents
- High-Performance Computer Architectures
- Programming Methodologies
- ParADE Architecture
- Design Issues Approaches
- Experiments
- Conclusions
- Discussions
3High-Performance Computer Architectures
4Top 500 Supercomputers
500
SMPs (Symmetric Multiprocessors)
450
Clusters
400
Constellations
350
300
Number of HPCs
250
200
MPPs (Massively Parallel Processors)
150
100
www.top500.org
50
0
1997
1998
1999
2000
2001
2002
Years
5Convergence of High-Performance Computer
Architectures
High-Performance Interconnection Network
Memory
Memory
Memory
Memory
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
Multiprocessors
Multiprocessors
Multiprocessors
Multiprocessors
Multicomputer with multiprocessor nodes!
6SMP Cluster
- COTS (Commodity Off-the-shelf)
- Microprocessor
- High-performance network
- How to exploit?
- For high-throughput or high-availability
- For high-performance computing
7Programming Methodologies
8Programming methodologies
- Architecture
- A hybrid of message passing and shared address
space architectures - Programming methodologies
- Pure message passing model
- Pure shared address space model
- Hybrid of message passing shared address space
programming models
9Pure Message Passing Model
High-Performance Network
NIC
NIC
MPI Process
MPI Process
MPI Process
MPI Process
MPI Process
MPI Process
MPI Process
MPI Process
Multiprocessor Node
Multiprocessor Node
10Pure Shared Address Space Model
Shared Virtual Memory
SVM Thread
SVM Thread
SVM Thread
SVM Thread
H/W SM
H/W SM
SVM Thread
SVM Thread
SVM Thread
SVM Thread
Multiprocessor Node
Multiprocessor Node
11Hybrid of Message Passing and Shared Address
Space Models
High-Performance Network
NIC
NIC
OpenMP Thread
OpenMP Thread
OpenMP Thread
OpenMP Thread
H/W SM
H/W SM
OpenMP Thread
OpenMP Thread
OpenMP Thread
OpenMP Thread
MPI Process
MPI Process
Multiprocessor Node
Multiprocessor Node
12Comparison between Parallel Programming
Methodologies
Pure MP
Pure SAS
Hybrid
Programming Cost
Expensive
Cheap
Expensive
Performance on SMP Cluster
Good
Poor
Good
Portability
Good
Poor
Good
Broad
Target Architecture Range
Broad
Broad
Easy and High-Performance Programming?
13ParADE Architecture
14ParADE (Parallel Application Developing
Environment)
- Motivation
- Easy and high-performance parallel programming on
SMP cluster systems - Approaches
- Easy programming
- Standardized shared address space programming
- High-performance
- Hybrid execution of message passing and shared
address space
15Why OpenMP?
- Easy programming
- Shared address space programming model
- Portable
- Standardized programming model
- Incremental parallelization
- Directive-based parallelism
16Why Hybrid Execution?
- Map OpenMP directives to message passing
primitives - Implementation barrier
- Map OpenMP directives to SVM primitives
- Poor performance
- Hybrid execution is intuitive and reasonable.
- How?
17OpenMP on SVM
- OpenMP on TreadMarks
- By RICE university
- OpenMP translator
- Multi-threaded SVM
- Omni/Scash
- By RWCP
- Omni OpenMP compiler
- Compiler-assisted DSM
- OpenMP on SVM
- By Purdue
- Multi-threaded SVM
- Under construction
18How?
- ParADE OpenMP translator
- Analyze an OpenMP program and generate a hybrid C
program of message passing and POSIX threads - ParADE runtime system
- Compile and link the converted program with
ParADE runtime library
19ParADE Architecture
OpenMP translator
Runtime system
Kernel
20Design Issues Approaches
21Issues in ParADE
- Beyond OpenMP on SVM
- Translator
- Optimizes synchronization work-sharing
directives - Runtime system
- Solves atomic page update problem
- Exploits data locality
- Simplifies memory consistency protocol
22First Focus Atomic Page Update
- Conventional page-based SVM
- Based on virtual memory protection mechanism
- Uses SIGSEGV and SIGIO signals
- Multiple threads in data race
- Long page fetch latency
23Atomic Page Update Problem
SIGSEGV Handler
Read(A)
SIGIO Handler
mmap(A, PROT_WRITE)
Request
SIGSEGV
Read(A) garbage
Page
mprotect(A, PROT_READ)
T2
T1
T1
Process 1
Process 2
24Conventional Solution File Mapping
Application View
System View
Protected Address Space
Freely Accessible Address Space
A mmap(0, Size,
PROT_READPROT_WRITE,
MAP_SHAREDMAP_FILE, fd, 0)
S mmap(0, Size,
PROT_READPROT_WRITE,
MAP_SHAREDMAP_FILE, fd, 0)
mprotect(A, Size, PROT_NONE)
File
fd open(FileName, O_RDWRO_CREAT,
S_IRWXU) write(fd, zero, Size)
25Multiple Paths to A Physical Page
Virtual Address 1
General memory access
Physical Address
READ NONE
Application address space
Physical Page
Virtual Address 2
System memory update
Physical Address
WRITE
System address space
OS Kernel
26Solution1 System V Shared Memory
Application View
System View
Freely Accessible Address Space
Protected Address Space
A shmat(ID, 0, 0)
S shmat(ID,0,0)
mprotect(A, Size, PROT_NONE)
Segment
Segment
Segment
ID shmget(IPC_PRIVATE, Size, IPC_CREATIPC_EXCL
SHM_RSHM_W)
27Solution2New mdup() System Call
A mmap(0, Size, PROT_NONE,
MAP_SHAREDMAP_ANONYMOUS, -1, 0)
A
X
X
Y
Z
Y
S mdup(A, Size)
S
Z
X
Y
physical pages
Z
mprotect(S, Size, PROT_READPROT_WRITE)
28Solution3Child Process Creation
A mmap(0, Size, PROT_NONE,
MAP_SHAREDMAP_ANONYMOUS, -1, 0)
A
X
Code Segment
Y
Z
Copy-On-Write Area
fork
A
Shared Memory Area
X
Y
Z
mprotect(A, Size, PROT_READPROT_WRITE)
29Pros Cons
File mapping
System V shared memory
mdup()
Child process creation
Initialization cost
Expensive
Cheap
Cheap
Cheap
Portability
Good
Restricted
Bad
Good
Address space shrinkage
Yes
Yes
Yes
No
Miscellaneous
Memory leakage Restricted use of mprotect()
IPC Overhead
Unnecessary disk write
30Performance NAS Kernels (A Class)
31Second Focus Low-Cost Synchronization Directives
- Simplify the synchronization mechanism
- Exploit message passing for synchronization and
work-sharing directives - Work-sharing directive
- pragma omp single
- Synchronization directives
- pragma omp critical
- pragma omp atomic
- pragma omp . reduction()
32Critical Sections CRITICAL, ATOMIC, REDUCTION
Acquire(Lock1) sum i Release(Lock1)
DSM
pragma omp atomic sum1 i
pthread_mutex_lock(lock) sum i if(the last
thread) MPI_Allreduce(sum, , MPI_SUM,
) pthread_mutex_unlock(lock)
ParADE
33Critical Sections SINGLE
// start of single construct
Acquire(Lock1) if(the first thread)
Init 0 Release(Lock1) Barrier() //
end of single construct
DSM
pragma omp single Init 0
// start of single construct
pthread_mutex_lock(lock) if(the first
thread) if(the master node)
Init 0 MPI_Bcast(Init, )
pthread_mutex_unlock(lock)
parade_barrier() // end of single construct
ParADE
34PerformanceATOMIC Directive
ParADE
HLRC-DSM
2.50
2.00
1.50
Execution Time (ms)
1.00
0.50
0.00
2
4
6
8
of Nodes
35PerformanceSINGLE Directive
ParADE
HLRC-DSM
2.00
1.80
1.60
1.40
1.20
Execution Time (ms)
1.00
0.80
0.60
0.40
0.20
0.00
2
4
6
8
of Nodes
36Experiments
37Experiments
- 2 A-Class NAS Kernels
- CG (64MB)
- EP (0MB)
- 2 OpenMP applications
- Wave equation (32MB)
- Molecular dynamics (1MB)
- Configurations
- 1Proc-1CPU Uniprocessor kernel 1 application
process per node - 1Proc-2CPU SMP kernel 1 application process
per node - 2Proc-2CPU SMP kernel 2 application processes
per node
38SMP Cluster for Experiment
PC
PC
PC
PC
PC
PC
PC
PC
P
P
Memory
Ethernet Switch
VIA Switch
VIA
Ethernet
Dual-Pentium III (600Mhz, 512KB) 512MB, Redhat 8.0
39NAS CG (A-Class)
40NAS EP (A-Class)
41Wave Equation Solver
42Molecular Dynamics
43Conclusions
44Conclusions
- ParADE
- Provides easy programming by adopting the OpenMP
model - Utilizes an SMP cluster by supporting a hybrid
execution of message passing and shared address
space - Demonstrates good scalability
- For more information
- Refer to the ACM/IEEE Supercomputing 2003
proceedings - ParADE An OpenMP-based Programming Environment
for SMP Cluster Systems
45Discussions
46Ongoing Project
- Smart translator
- Exploit data locality
- Runtime system
- Dynamic load-balancing in SVM
- Adaptive computing
- Interdisciplinary Applications
47Smart Translator
- Ratio of computation to communication
- The major source of load-imbalance is page
transfer! - How to exploit data locality?
- Smart translator
- Identifies the remote accesses during the
translation process - Declares the pages to be pinned down
48Dynamic Load-Balancing in SVM
- Support various loop scheduling methods
- Workload vs. data locality
49Adaptive Computing
- Find out the best configuration fit for the given
problem - How many nodes?
- Which nodes in a heterogeneous cluster?
- How many processors in a node?
- How to change the configuration at runtime?
50Interdisciplinary Applications
- Is ParADE or any system useful?
- Lack of applications in computer science
- Real applications tell the truth.
- Understand various scientific applications
- Extend the horizon of our knowledge!
51Thank you!