Parallel Programming Environment for SMP Clusters - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Parallel Programming Environment for SMP Clusters

Description:

NIC. NIC. High-Performance Network. Multiprocessor Node. Multiprocessor Node. July 15, 2003, KAIST ... NIC. NIC. High-Performance Network. MPI Process. MPI ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 52
Provided by: wwwcsa
Category:

less

Transcript and Presenter's Notes

Title: Parallel Programming Environment for SMP Clusters


1
Parallel Programming Environment for SMP Clusters
  • Seoul National University
  • Co-Design and Parallel Processing Lab.
  • Yang-Suk Kee

2
Contents
  • High-Performance Computer Architectures
  • Programming Methodologies
  • ParADE Architecture
  • Design Issues Approaches
  • Experiments
  • Conclusions
  • Discussions

3
High-Performance Computer Architectures
4
Top 500 Supercomputers
500
SMPs (Symmetric Multiprocessors)
450
Clusters
400
Constellations
350
300
Number of HPCs
250
200
MPPs (Massively Parallel Processors)
150
100
www.top500.org
50
0
1997
1998
1999
2000
2001
2002
Years
5
Convergence of High-Performance Computer
Architectures
High-Performance Interconnection Network
Memory
Memory
Memory
Memory
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
Multiprocessors
Multiprocessors
Multiprocessors
Multiprocessors
Multicomputer with multiprocessor nodes!
6
SMP Cluster
  • COTS (Commodity Off-the-shelf)
  • Microprocessor
  • High-performance network
  • How to exploit?
  • For high-throughput or high-availability
  • For high-performance computing

7
Programming Methodologies
8
Programming methodologies
  • Architecture
  • A hybrid of message passing and shared address
    space architectures
  • Programming methodologies
  • Pure message passing model
  • Pure shared address space model
  • Hybrid of message passing shared address space
    programming models

9
Pure Message Passing Model
High-Performance Network
NIC
NIC
MPI Process
MPI Process
MPI Process
MPI Process
MPI Process
MPI Process
MPI Process
MPI Process
Multiprocessor Node
Multiprocessor Node
10
Pure Shared Address Space Model
Shared Virtual Memory
SVM Thread
SVM Thread
SVM Thread
SVM Thread
H/W SM
H/W SM
SVM Thread
SVM Thread
SVM Thread
SVM Thread
Multiprocessor Node
Multiprocessor Node
11
Hybrid of Message Passing and Shared Address
Space Models
High-Performance Network
NIC
NIC
OpenMP Thread
OpenMP Thread
OpenMP Thread
OpenMP Thread
H/W SM
H/W SM
OpenMP Thread
OpenMP Thread
OpenMP Thread
OpenMP Thread
MPI Process
MPI Process
Multiprocessor Node
Multiprocessor Node
12
Comparison between Parallel Programming
Methodologies
Pure MP
Pure SAS
Hybrid
Programming Cost
Expensive
Cheap
Expensive
Performance on SMP Cluster
Good
Poor
Good
Portability
Good
Poor
Good
Broad
Target Architecture Range
Broad
Broad
Easy and High-Performance Programming?
13
ParADE Architecture
14
ParADE (Parallel Application Developing
Environment)
  • Motivation
  • Easy and high-performance parallel programming on
    SMP cluster systems
  • Approaches
  • Easy programming
  • Standardized shared address space programming
  • High-performance
  • Hybrid execution of message passing and shared
    address space

15
Why OpenMP?
  • Easy programming
  • Shared address space programming model
  • Portable
  • Standardized programming model
  • Incremental parallelization
  • Directive-based parallelism

16
Why Hybrid Execution?
  • Map OpenMP directives to message passing
    primitives
  • Implementation barrier
  • Map OpenMP directives to SVM primitives
  • Poor performance
  • Hybrid execution is intuitive and reasonable.
  • How?

17
OpenMP on SVM
  • OpenMP on TreadMarks
  • By RICE university
  • OpenMP translator
  • Multi-threaded SVM
  • Omni/Scash
  • By RWCP
  • Omni OpenMP compiler
  • Compiler-assisted DSM
  • OpenMP on SVM
  • By Purdue
  • Multi-threaded SVM
  • Under construction

18
How?
  • ParADE OpenMP translator
  • Analyze an OpenMP program and generate a hybrid C
    program of message passing and POSIX threads
  • ParADE runtime system
  • Compile and link the converted program with
    ParADE runtime library

19
ParADE Architecture
OpenMP translator
Runtime system
Kernel
20
Design Issues Approaches
21
Issues in ParADE
  • Beyond OpenMP on SVM
  • Translator
  • Optimizes synchronization work-sharing
    directives
  • Runtime system
  • Solves atomic page update problem
  • Exploits data locality
  • Simplifies memory consistency protocol

22
First Focus Atomic Page Update
  • Conventional page-based SVM
  • Based on virtual memory protection mechanism
  • Uses SIGSEGV and SIGIO signals
  • Multiple threads in data race
  • Long page fetch latency

23
Atomic Page Update Problem
SIGSEGV Handler
Read(A)
SIGIO Handler
mmap(A, PROT_WRITE)
Request
SIGSEGV
Read(A) garbage
Page
mprotect(A, PROT_READ)
T2
T1
T1
Process 1
Process 2
24
Conventional Solution File Mapping
Application View
System View
Protected Address Space
Freely Accessible Address Space
A mmap(0, Size,
PROT_READPROT_WRITE,
MAP_SHAREDMAP_FILE, fd, 0)
S mmap(0, Size,
PROT_READPROT_WRITE,
MAP_SHAREDMAP_FILE, fd, 0)
mprotect(A, Size, PROT_NONE)
File
fd open(FileName, O_RDWRO_CREAT,
S_IRWXU) write(fd, zero, Size)
25
Multiple Paths to A Physical Page
Virtual Address 1
General memory access
Physical Address
READ NONE
Application address space
Physical Page
Virtual Address 2
System memory update
Physical Address
WRITE
System address space
OS Kernel
26
Solution1 System V Shared Memory
Application View
System View
Freely Accessible Address Space
Protected Address Space
A shmat(ID, 0, 0)
S shmat(ID,0,0)
mprotect(A, Size, PROT_NONE)
Segment
Segment
Segment
ID shmget(IPC_PRIVATE, Size, IPC_CREATIPC_EXCL
SHM_RSHM_W)
27
Solution2New mdup() System Call
A mmap(0, Size, PROT_NONE,
MAP_SHAREDMAP_ANONYMOUS, -1, 0)
A
X
X
Y
Z
Y
S mdup(A, Size)
S
Z
X
Y
physical pages
Z
mprotect(S, Size, PROT_READPROT_WRITE)
28
Solution3Child Process Creation
A mmap(0, Size, PROT_NONE,
MAP_SHAREDMAP_ANONYMOUS, -1, 0)
A
X
Code Segment
Y
Z
Copy-On-Write Area
fork
A
Shared Memory Area
X
Y
Z
mprotect(A, Size, PROT_READPROT_WRITE)
29
Pros Cons
File mapping
System V shared memory
mdup()
Child process creation
Initialization cost
Expensive
Cheap
Cheap
Cheap
Portability
Good
Restricted
Bad
Good
Address space shrinkage
Yes
Yes
Yes
No
Miscellaneous
Memory leakage Restricted use of mprotect()
IPC Overhead
Unnecessary disk write
30
Performance NAS Kernels (A Class)
31
Second Focus Low-Cost Synchronization Directives
  • Simplify the synchronization mechanism
  • Exploit message passing for synchronization and
    work-sharing directives
  • Work-sharing directive
  • pragma omp single
  • Synchronization directives
  • pragma omp critical
  • pragma omp atomic
  • pragma omp . reduction()

32
Critical Sections CRITICAL, ATOMIC, REDUCTION
Acquire(Lock1) sum i Release(Lock1)
DSM
pragma omp atomic sum1 i
pthread_mutex_lock(lock) sum i if(the last
thread) MPI_Allreduce(sum, , MPI_SUM,
) pthread_mutex_unlock(lock)
ParADE
33
Critical Sections SINGLE
// start of single construct
Acquire(Lock1) if(the first thread)
Init 0 Release(Lock1) Barrier() //
end of single construct
DSM
pragma omp single Init 0
// start of single construct
pthread_mutex_lock(lock) if(the first
thread) if(the master node)
Init 0 MPI_Bcast(Init, )
pthread_mutex_unlock(lock)
parade_barrier() // end of single construct
ParADE
34
PerformanceATOMIC Directive
ParADE
HLRC-DSM
2.50
2.00
1.50
Execution Time (ms)
1.00
0.50
0.00
2
4
6
8
of Nodes
35
PerformanceSINGLE Directive
ParADE
HLRC-DSM
2.00
1.80
1.60
1.40
1.20
Execution Time (ms)
1.00
0.80
0.60
0.40
0.20
0.00
2
4
6
8
of Nodes
36
Experiments
37
Experiments
  • 2 A-Class NAS Kernels
  • CG (64MB)
  • EP (0MB)
  • 2 OpenMP applications
  • Wave equation (32MB)
  • Molecular dynamics (1MB)
  • Configurations
  • 1Proc-1CPU Uniprocessor kernel 1 application
    process per node
  • 1Proc-2CPU SMP kernel 1 application process
    per node
  • 2Proc-2CPU SMP kernel 2 application processes
    per node

38
SMP Cluster for Experiment
PC
PC
PC
PC
PC
PC
PC
PC
P
P
Memory
Ethernet Switch
VIA Switch
VIA
Ethernet
Dual-Pentium III (600Mhz, 512KB) 512MB, Redhat 8.0
39
NAS CG (A-Class)
40
NAS EP (A-Class)
41
Wave Equation Solver
42
Molecular Dynamics
43
Conclusions
44
Conclusions
  • ParADE
  • Provides easy programming by adopting the OpenMP
    model
  • Utilizes an SMP cluster by supporting a hybrid
    execution of message passing and shared address
    space
  • Demonstrates good scalability
  • For more information
  • Refer to the ACM/IEEE Supercomputing 2003
    proceedings
  • ParADE An OpenMP-based Programming Environment
    for SMP Cluster Systems

45
Discussions
46
Ongoing Project
  • Smart translator
  • Exploit data locality
  • Runtime system
  • Dynamic load-balancing in SVM
  • Adaptive computing
  • Interdisciplinary Applications

47
Smart Translator
  • Ratio of computation to communication
  • The major source of load-imbalance is page
    transfer!
  • How to exploit data locality?
  • Smart translator
  • Identifies the remote accesses during the
    translation process
  • Declares the pages to be pinned down

48
Dynamic Load-Balancing in SVM
  • Support various loop scheduling methods
  • Workload vs. data locality

49
Adaptive Computing
  • Find out the best configuration fit for the given
    problem
  • How many nodes?
  • Which nodes in a heterogeneous cluster?
  • How many processors in a node?
  • How to change the configuration at runtime?

50
Interdisciplinary Applications
  • Is ParADE or any system useful?
  • Lack of applications in computer science
  • Real applications tell the truth.
  • Understand various scientific applications
  • Extend the horizon of our knowledge!

51
Thank you!
  • Any Questions?
Write a Comment
User Comments (0)
About PowerShow.com